Okay. Hi, everyone. Um, so let's get started again today.
So today's lecture what I'm going to do,
is be talking about, um,
question answering over text.
Um, this is another of the big successes
in using deep learning inside natural language processing,
and it's also a technology that has some really obvious commercial uses.
So it's an, it's an area that has attracted
a lot of attention in the last couple of years.
So this is the overall plan.
Um, just a couple of reminders and things at the beginning about final project stuff,
and then we'll, basically all of it is talking about question-answering starting with, um,
motivation history, um, talking about the SQuAD data,
uh, a particular simple model, our Stanford Attentive Reader.
Then talking about some other more complex,
um, stuff into the most modern stuff.
Um, yeah, so in a census, um,
lecture serves a double purpose because if you're going to do the,
the default final project, well,
it's about textual question-answering,
and this is your chance to learn something about the area of textual question-answering,
and the kinds of models you might want to be thinking about and building.
Um but the content of this lecture pretty much is in
no way specifically tied to the default final project,
apart from by subject matter that really it's telling you about
how people use neural nets to build question-answering systems.
Okay. So first just quickly on the reminders,
um, mid-quarter survey.
I mean, a huge number of people,
um, have actually filled this in already.
Uh, we already had over 60 percent, um, um,
filling-it-in rate by which by the standards of people
who do surveys they come as a huge success already.
But if you're not in that percent, um,
we'd still love to have your feedback and now's the perfect time to do it.
Um, yeah.
I just wanted to sort of have a note on custom final projects.
Um, so in general, um,
it's great to get feedback on custom final projects.
There's a formal mechanism for that which is
the project proposal that I mentioned last time.
It's also great to chat to people,
um, informally about, um, final projects.
And so I'm one of those people and I have
been talking to lots of people about final projects,
and, uh, very happy to do so.
But there's sort of a problem that there's only one of me.
Um, so I do also, um,
encourage you to realize that among the various TAs that
really lots of them have had experience of different deep learning projects,
and in particular on the office hours page,
there's a table that's like this but you can read it if you look at it on your own laptop,
which talks about the experience of different TA's.
And many of them have experience in different areas,
and many of them are also good people to talk to about final projects.
Okay. Um, so for the default final project, the textual question-answering.
So um, draft materials for that app today,
um, right now on the website actually.
Um, we're calling them draft because we think that there are still
probably a few things that are gonna get changed over the next week,
so um, don't regard as completely final in terms of the code that,
you know, it's sort of 90 percent final.
So in terms of deciding whether you're going to do, um,
a custom final project or a default final project,
and working out what you're putting into your project proposal.
Um, it should be, you know,
well more than, um,
what you need for this year.
Okay. The one other, um,
final bit I just wanted to say that I didn't get to
last time is so for the final projects,
regardless of which kind you're doing,
um, well, part of it is, um,
doing some experiments, of
doing stuff with data and code,
and getting some numbers and things like that.
But I do really, um,
encourage people to also remember that an important part of
the final project is writing a final project report.
And this is no different to any research project of the kinds that,
um, students do for conferences or journals and things like that, right?
You spend months commonly working over your code and experiments.
But in most cases,
the main evaluation of your work is from people reading,
a written paper output version of things.
So it's really important that,
that paper version sort of reflects the work
that you did and the interesting ideas that you came up with,
and explains them well and present your experiments,
and all of those things.
And so we encourage you to sort of do a good job at writing up your projects.
Um, here is just sort of a vague outline of, you know,
what a typical project write-up is likely to look like.
Now, there isn't really one size completely fits all
because depending on what you've done different things might be appropriate.
But, you know, typically the first page,
you'll have an abstract for the paper and the introduction to the paper.
You'll spend some time talking about related prior work.
Um, you'll talk about what kind of models you built for a while.
Um, there's probably some discussion of what data you are using for your projects.
Um, experiments commonly with some tables and figures about the things that you're doing.
Um, more tables and figures talking about the results as to how well your systems work.
Um, it's great to have some error analysis to see
what kind of things that you got right and wrong,
and then maybe at the end there's sort of
plans for the future, conclusions, or something like that.
Okay. Um, that's sort of it for my extra administrative reminders.
Um, are there any questions on final projects that people are dying to know?
[NOISE] Okay. Good luck.
I just meant to say good luck.
Yeah. Good luck with your final projects. [LAUGHTER] Okay.
So now moving into,
um, yeah, the question answering.
Okay. So, I mean- so question answering is
a very direct application for something that human beings,
um, want to do.
Um, well, maybe human beings don't in general want to know this.
Um, here's my query of "Who was Australia's third prime minister?".
Um, maybe, yeah, that's not really the kind of
thing you're gonna put into your queries but,
you know, maybe you query,
"Who was the lead singer of Big Thief?"
or something like that. I don't know.
Um, you're, uh, but you know,
lots- a large percentage of stuff [NOISE] on the web
is that people actually are asking for answers to questions.
And so, if I put in this query into Google,
it actually just works.
It tells me the answer is John Christian Watson.
And, um, so that's sort of question answering working in the real world.
Um, if you try different kinds of questions in Google,
you'll find that some of them work and lots of them don't work.
And when they don't work,
you're just sort of getting whatever kind of information retrieval, web search results.
Um, there is one fine point that I just wanted,
um, to mention down here.
So another thing that Google has is the Google Knowledge Graph,
which is a structured graph representation of knowledge.
And some kinds of questions,
um, being answered from that structured knowledge representation.
And so, I mean,
quite a lot of the time for things like movies,
it's coming from that structured graph.
If you're sort of saying, "Who's the director of a movie?"
or something like that.
But this answer isn't coming from that.
This answer is a genuine,
the kind of stuff we're gonna talk about today.
It's textual question answering from a web page where
Google's question and answering system has
extracted the answer and is sticking it up there.
Um, if you're, um,
wanting to explore these things, um,
if you get one of these boxes sort of down here where I've cut it off,
there's a little bit of gray that says,
"How did I get this result?".
And if you click on that,
it actually tells you what source it's getting it from and you can see if it's doing it
from the textual question answering system or from something like the Knowledge Graph.
Okay. Um, so the- in general,
the motivation for question answering is that these days there's
just these sort of massive collections of full text documents,
i.e., there's the web.
Um, so that there are sort of billions of documents of information.
And traditionally, when people first started
thinking about search information retrieval as a field,
you know, nothing of that kind of quantity and size existed, right?
That when people first started building search systems,
it was sort of unthinkable to index
whole documents because no one had hard disks big enough in those days, right?
That really- they were indexing titles or titles and abstracts or something like that.
And so, it seemed perfectly adequate in those days to say, "Okay.
We're just gonna send you- give you your results."
as to "Here's a list of documents."
because the documents are only a hundred words long.
But that's clearly not the case now when we have the sort of, you know,
ten minute read, Medium posts um, which might have the answer to a question.
And so, there's this need to sort of say, "Well,
can we just have systems that will give us answers to questions?".
And a lot of the recent changes in technology have hugely underlined that need.
So, returning documents works okay if you're sitting at your laptop,
but it works really terribly if you're on your phone and it works even more
terribly if you're trying to work with speech on a digital assistant device,
something like an Alexa system.
And so, we really want to actually be able to produce
systems that can give the answers to people's questions.
And so typically, doing that is factored into two parts.
That the first part of that is we still do information retrieval.
We use stand- normally quite standard information retrieval techniques to
find documents that quite likely to con- maintain- contain an answer.
And the reason that this is normally done by quite traditional techniques is because
the traditional techniques are extremely scalable over billions of documents,
whereas current neural systems actually
aren't really scalable over billions of documents.
But that's an area in sort of which research is ongoing.
But then once we have sort of some candidate likely documents,
we want to find, uh,
do they contain an answer,
and if so, what is the answer?
And so at that point,
we have a document or a paragraph,
and we're saying, "Can we answer this question from there?"
And then that problem is often referred to as the Reading Comprehension problem.
And so that's really what I'm gonna focus on today.
Um, Reading Comprehension isn't a new problem.
I mean it- you can trace it back into the early days of artificial intelligence and NLP.
So, back in the 70's,
a lot of NLP work was trying to do Reading Comprehension.
I mean one of the famous strands of that, um, was, um,
Sir Roger Shank was a famous,
um, early NLP person.
Though not a terribly nice man.
I don't think, actually.
Um, but the Yale School of AI was a very well-known,
um, NLP approach and really,
it was very focused on Reading Comprehension.
Um, but it's sort of,
you know, I think it was sort of the time, it was too early in any way.
It sort of died out. Nothing much came out of that.
Um, but then in- right just before the turn of the mil- millennium,
Lynette Hirschman revived this idea and said, "Well,
maybe a good challenge would be to find the kind of
Reading Comprehension questions that elementary school kids do,
and let's see if we could get,
um, computers to do that.
And some people tried that with fairly simple methods,
which only work mediocrely.
Then sort of somewhat after that, um,
Chris Burges who was a guy who was at
Microsoft Research and he wasn't really an NLP person at all.
He was a machine learning person,
but he got it into his head, um,
that while really a big problem that should be being worked on is
Machine Comprehension and he suggested that you sort of could codify it like this.
And this is a particular clean codification
that has lived on and we'll look at more today.
All right. So, a machine comprehends a passage of text.
If there's any question regarding that text that can be
answered correctly by a majority of native speakers,
that machine can provide a string,
which those speakers would agree both answers
that question and does not contain information irrelevant to that question.
Um, and he sort of proposed this as sort of a challenge problem for
artificial intelligence and set about collecting a corpus,
the MCTest corpus, which was meant to be a simple Reading Comprehension challenge.
Um, so they collected, um,
stories, um, which, um,
were meant to be kids' stories, you know.
"Alyssa got to the beach after a long trip.
She's from Charlotte. She traveled from Atlanta.
She's now in Miami".
Sort of pretty easy stuff.
And then there were questions.
"Why did Alyssa go to Miami?"
Um, and then the answer is,
"To visit some friends".
And so you've got there this string that is coming from the passage.
That's the answer to the question.
Um, so the MCTest is a corpus of
about 600 such stories and that challenge existed, and a few people worked on it.
But that never really went very far either for the next couple of years.
But what really changed things was that in 2015,
and then with more stuff in 2016,
um, deep learning people got interested in this idea of,
"Could we perhaps build neural question answering systems?"
And it seemed like if you wanted to do that, um,
something like MCTest could only be a test set
and the ways to make progress would be to do what had been done
in other domains and to actually build just- hand build a large training set of passages,
questions, and answers in such a way that would be able to train neural networks using
the kind of supervised learning techniques that we've
concentrated on so far in this class.
And indeed, the kind of supervised neural network learning techniques,
which is [NOISE] actually the successful stuff that
powers nearly all the applications of deep learning,
not only in NLP,
but also in other fields like vision.
Um, and so the first subs- the first such dataset was built by
people at DeepMind over CNN and Daily Mail news stories.
Um, but then the next year, um,
Pranav Rajpurkar is a Stanford PhD student
working with Percy Liang and a couple of other students, um,
produced the SQuAD dataset,
which was actually a much better designed dataset and proved to be
sort of much more successful at driving this forward.
And then following along from that,
other people started to produce lots of other,
um, question answering datasets which, you know,
many of them have interesting advantages
and disadvantages of their own including MS MARCO,
TriviaQA, RACE, blah, blah, blah, lots of them.
Um, but for today's class,
I'm gonna concentrate on SQuAD,
because SQuAD is actually the one that has been by far the most widely used.
And because it - it was just a well-constructed clean dataset,
that it sort of just proved a profitable one for people to work with.
[NOISE]
Okay. Um, so, that was reading comprehension.
I'll also just quickly tell you the, um,
the history of open domain question answering.
So, the difference here for the- the field of
Open-domain Question Answering that we're saying, okay,
there's an encyclopedia or there's a web crawl,
I'm just going to ask a question,
can you answer it?
So, it's this bigger task of question answering.
And, you know, that was something that again was thought about,
um, very early on.
So, there's this kind of early, um,
CACM paper by Simmons who sort of explores how you could
do answering questions as textual question-answering, um, and yet, you know,
he has the idea that what's going to
happen is you're gonna dependency parse the question,
and dependency parse sentences of the text,
and then sort of do tree matching over the dependency parses,
um, to get out the answers.
And, you know, that's in some sense
actually prefigured work that people actually were then attempting to do 35 years later.
Um, getting a bit more modern, um, Julian Kupiec,
she was working at Xerox PARC at the time,
um, came up with this system called MURAX,
and so at this stage in the 90s there started to be the first, um,
digitally available encyclopedias available,
so he was using the Grolier's Encyclopedia,
and so he said about trying to build a system that could answer
questions over that encyclopedia using,
in general, fairly sort of shallow, um,
linguistic processing methods, i.e, regular expressions.
Um, for, after [LAUGHTER] having, um,
done information retrieval search over that.
But that started to evoke more interest from other people,
and so in 1999 the US National Institutes of Standards and Technology, um,
instituted a TREC question-answering track where the idea was,
there was a large collection of News-wire documents,
and you could be asked to provide the question of them,
and lots of people started to build question answering systems.
Indeed, if in some sense that was
this competition which was where people at IBM started,
um, working on textual question-answering,
and then, um, sort of a decade later, um,
IBM rejigged things into the sexier format of,
um, let's build a Jeopardy contestant rather than let's answer questions from the news,
and that then led to their DeepQA system in 2011.
Which I presume quite a few of you saw,
these people saw Jeopardy IBM?
Yeah, some of you.
Okay. So, that they were able to successfully, um,
build a question answering system that could compete at Jeopardy, um, and win.
Um, and, you know, like a lot of these demonstrations of
technological success there are things you can quibble about the way it was set up,
um, that really the kind of computer just had
a speed advantage versus the human beings that had to buzz in to answer the question.
But, you know, nevertheless, fundamentally,
the textual question-answering had to work,
that this was a system that was answering questions mainly based on textual passages,
and it had to be able to find the answers to those questions correctly,
for the system to work.
Um, so then, more recently again, um,
and really the first piece of work that did this with a neural system was,
um, work that was, um,
done by a Stanford PhD student,
that I'll get to later,
was then the idea of well,
could we replace traditional complex question answering systems
by using a neural reading comprehension system,
and that's proved to be very successful.
So, to, to explain that a little bit more, um,
if you look at the kind of systems that were built for TREC question-answering,
um, they were very complex multi-part systems.
And really, if you then look at something like,
IBM's Deep QA system it was sort of like this
times 10 because it both had very complex systems like this,
but it ensembled together sort of six different components in every place,
and then did sort of,
um, classify a combination on top of them.
But so far, the current-.
This is sort of around a sort of a 2003 question answering system,
and so the kind of things that went through is,
so when there was a question,
it parsed the question with a parser
kind of like the ones we saw with our dependency parsers.
It did some sort of
handwritten semantic normalization rules to try and get them into a better semantic form.
It then had a question type classifier which tried to
work out what kind of semantic type is this question looking for,
is it looking for a person name,
or a country name,
or a temperature, or something like that.
Um, it would, um, then, um,
have an information retrieval system out of the document collection,
um, which would find paragraphs that were likely to contain the answers.
Um, and then it would have a method of ranking
those paragraph choices to see which ones are likely to have the answers.
Um, it would then,
um, over there somewhere, um,
run Named Entity Recognition on those passages to find entities that were in them.
These systems depended strongly on the use of
fine matching entities because then it could look for
an entity which corresponded to the question type.
Um, then once it had candidate entities,
it had to actually try and determine whether
these entities did or didn't answer the question.
So, these people, this is the system from LCC by,
um, Sanda Harabagiu and Dan Moldovan.
They actually had some quite interesting stuff here,
where they had a kind of a loose theorem prover that would try and prove that, um,
the semantic form of a piece of text,
um, gave an answer to what the question was.
So, you know, that was kind of cool stuff with an Axiomatic Knowledge Base,
um, and eventually out would come an answer.
Um, so, you know, something that is,
I do just want to emphasize, you know,
sometimes with these deep learning courses you get these days,
the impression you have is that absolutely nothing worked before 2014,
uh, when we got back to deep learning,
and that's not actually true.
So, these kind of factoid question on,
these kind of question answering systems within
a certain domain actually really worked rather well.
Um, so, I started saying the word Factoid Question Answering,
and so let me explain that because that's the secret.
So, people, at least in NLP,
use the term "Factoid Question Answering" to mean
the case that your answer is a named entity.
So, it's sort of something like, you know,
what year was Elvis Presley born,
or what is the name of Beyonce's husband, or, um,
you know, which state,
um, has the most pork or something, I don't know.
Right, anything that's got,
anything that's sort of the answer is sort of some clear semantic type entity,
and that's your answer.
I mean, so, within the space of those kind of questions,
which actually is a significant part of the questions you get in web search, right?
Lots of web search is just, you know,
who was the star of this movie,
or what year was somebody born, right?
There's zillions of those all the time.
These systems actually really did work quite well
that they could get about 70 percent of those questions right,
um, which wasn't bad at all, um,
though that they really sort of didn't really
extend it out to other kinds of stuff beyond that.
But whatever virtues they had, um,
they were extremely complex systems that people spent years put togeth- putting together,
which had many components with a huge amount of hand-built stuff.
And most of the stuff was sort of built quite separately and tied together,
and you just sort of hope that it worked,
um, well, when put together in composite.
And so we can contrast that to what we then see later,
um, for neural network-style systems.
Okay. Um, so let me now say some more stuff about, um,
the Stanford Question Answering Dataset or SQuAD that I just mentioned a little bit ago,
and as this is the data for the default final project as well.
Um, so what SQuAD has is,
questions in SQuAD have a passage,
which is a paragraph from Wikipedia.
And then there is a question,
here it's, "Which team won Super Bowl 50?"
And the goal of the system is to come up with the answer to this question.
Um, human reading comprehension.
What is the answer to the question?
[NOISE]
Broncos.
Broncos. [LAUGHTER] Okay.
Yeah. Um, so that's the answer to the question.
Um, and so by construction for SQuAD,
the answer to a question is always a sub-sequence of words from the passage which is,
normally, it ends up being referred to as a span,
a sub-sequence of words from the passage.
So that's the only kind of questions you can have.
You can't have questions that are counting questions,
or yes, no questions, or anything like that.
You can just pick out a sub-sequence.
Um, okay.
But, um, so they created in the first version about 100,000 examples.
So there are a bunch of questions about each passage.
So it's sort of something like, um,
I think it's maybe sort of about five questions per passage,
and there are 20,000 different bits that Wikipedia uses, used.
Um, and this sort of must be a span form,
as often referred to as extractive question answering.
Okay. Um, here's just one more example
that can give you some more sense of some of the things that are there,
and it illustrates a couple of other factors.
Um, so, you know,
even this one, I guess the previous one wasn't, um,
completely obvious what your answers should be because
maybe you could say the answer should just have been Broncos,
or you could have said it was Denver Broncos.
Um, and in general,
even if you're answering with a span,
there's gonna be variation as to how long a span you choose.
Um, so what they did, um,
and so this was done with, on Mechanical Turk,
gathering the data, or building questions,
and getting answers, is that they got answers from three different people.
So here's this question,
"Along with non-governmental and non-state schools,
what is another name for private schools?"
And three human beings were asked the answer based on this passage.
And one said independent,
and two said independent schools.
Um, this one, all three people gave the same answer.
This one, again, you get two different answers,
so that they sample three answers.
And basically, then, you can be correct if you're going with any of the answers.
And so that sort of at least gives you a bit of robustness to variation in human answers.
Okay. And that starts me into the topic of evaluation.
Um, yeah.
And these slides here are entitled
SQuAD version 1.1 because that means in five minutes time,
I'm gonna tell you about SQuAD version 2,
which adds a bit more stuff into it,
but we'll just get 1.1 straight first.
All right. So there are three answers that col- were collected.
And so for evaluation metrics,
they suggested two evaluation metrics.
The first one is exact match.
So you're going to return a span.
If the span is one of these three,
you get one point,
and if the scan,
span is not one of these three,
you get zero for that question.
And then your accuracy is just the percent correct,
so that's extremely simple.
But the second metric, and actually,
the one that was favored as the primary metric,
was an F1 metric.
So what you do for this F1 metric
is you're matching at the word level for the different answers.
So you've treat each,
you treat the system span and each gold answer as a bag of words,
and then you work out a precision, which is,
um, the percent of words in the system's answer that are actually in a span,
i- in a gold span, the recall,
which is the percent of words in a gold span that are in the system's span.
And then you calculate the harmonic mean of those two numbers
and the harmonic mean is sort of a very conservative average.
So it's close to the mean of those two numbers,
and that gives you a score.
And what you then do is, for each question,
you'd return, you say its score is
the maximum F1 over the three different answers that were collected from human beings.
And then for the whole, um, dataset,
you then average those F1 scores across questions and that's then your final F1 result.
So that's a more complicated thing to say.
Um, and we provide there sort of a val code,
um, for you that does that.
Um, but it sort of seems that F1 is actually
a more reliable and better measure because if you use exact match,
you know, even though there's of,
a bit of robustness that comes on three people's answers,
three is not a very large sample,
so there's sort of a bit of guessing as to whether you get
exactly the same span some human being got,
whereas you're sort of going to get a reasonable score
in the F1 even if your boundaries are off by a little.
So the F1 metric sort of, um,
is more reliable and avoids various kinds of artifacts as to how big
or small an answer human beings tend to choose in some circumstances.
Um, and so that's sort of being used as
the primary metric that people score people on in the leader boards.
Um, final detail, both metrics, um,
ignore punctuation and the English articles a, an, the.
Okay. Um, so how did things work out?
Um, so for SQuAD version 1.1, um.
A long time ago,
at the end of 2016,
um, this is how the leaderboard looked.
Um, this is the bottom of the leaderboard at this point in
time because that allows me to show you a couple of things.
So down at the bottom of the leaderboard, um,
so they tested how well human beings did, um,
at answering these questions because you know,
human beings aren't perfect at answering questions either.
Um, and so the human performance that they measured,
um, had an F1 score of 91.2.
And I'll come back to that again in a minute.
Um, and so when they built the dataset,
they built a logistic regression baseline which was sort of a conventional NLP system.
So, they dependency parsed the question and sentences of the answer.
They looked for dependency.
So dependency link matches,
so a word at both ends with the dependency relation in
between and count and matches of those and sort of pointing to a likely answer.
Um, so as sort of a fairly competently built traditional NLP system of it's
not as complex as but it's sort of in
the same vein of that early question answering system I mentioned.
And it got an F1 of about 51.
So not hopeless, um,
but not that great compared to human beings.
And so, very shortly after that, um,
people then started building
neural network systems to try and do better at this task on this dataset.
And so, one of the first people to do this quite successfully,
um, were these people from Singapore Management University,
maybe not the first place you would have thought of but, um,
they were really sort of the first people who showed that, yes,
you could build an end-to-end trained neural network
for this task and do rather better.
And so, they got up to 67 F1.
Um, and well, then they had a second system.
They got 70 and then things started,
um, to, um, go on.
So that even by,
um, the end of 2016,
um, there started to be systems that really worked rather well on this task.
Um, so here, this time was the,
um, top of the leaderboard.
So I'll talk later about this BiDAF system from, uh,
the AI to,
Allen Institute for Artificial Intelligence and the University of Washington.
So, it was getting to 77 as
a single system that like in just about all machine learning,
people pretty soon noticed that if you made
an ensemble of identically structured systems,
you could push the number higher and so if you ensemble those,
you could then get another sort of whatever it is about four points
and get up to 81, um, F1.
And so this was sort of around the situation when in the, uh, 2017, um,
224N class, we first used SQuAD version one as jus- as a default final project.
And at that point, you know,
actually the best students got almost to the top of this leaderboard.
So our best, um,
CS224N Final Project in winter 2017 made it into,
um, the equivalent of fourth place on this leaderboard,
um, with 77.5 as their score.
So that was really rather cool.
Um, but that's a couple of years ago and since then,
people have started building, um,
bigger and bigger and more and more complex, um, systems.
And, um, so essentially,
you could sort of say that SQuAD version one is basically solved.
So the very best systems are now getting
F1 scores that are in the low 90s and in particular,
you can see that the best couple of, um,
systems have higher F1s and
well higher exact matches than what was measured for human beings.
Uh, but like a lot of the claims of
deep learning being better and performing from human being,
than human beings, there's sort of some asterisks you can put after that.
I mean, in particular for this dataset,
the way they measured human performance was a little bit
unfair because they only actually collected three human beings' answers.
So, to judge, um, the human performance,
the hu- those hu- each of those humans was being scored versus only two other humans.
And so, that means you only had two chances to match instead of three.
So, there's actually sort of a systematic underscoring of the human performance.
But whatever, systems got very good at doing this.
Um, so the next step, um,
was then to introduce, uh,
the SQuAD vers- version 2 task.
And so many people felt that a defect of SQuAD version
1 was that in all cases, questions had answers.
So, that you just had to find the answer in the paragraph,
um, and so that's sort of turned into a kind of a ranking task.
You just had to work out what seems the most likely answer.
I'll return that without really having
any idea whether it was an answer to the question or not.
And so, for SQuAD version two,
for the dev and test sets,
half of the questions have answers and half of
the questions just don't have an answer in the passage,
um, it's slightly different distribution, the training data.
Um, and the way it works for scoring is the sort of, like,
the no answer kind of counts as like one word as a sort of a special token.
So, if it's, if it should be a no answer and you say no answer,
you get a score of one on the either exact match or the F-measure.
And if you don't do that,
you get a score of zero.
Um, and so, the simplest way of approaching SQuAD 2.0 would be to say, well,
rather than just always returning the best match in my system,
I'll use some kind of threshold and only if the score is above a threshold,
our counters and answer.
You could do more sophisticated things.
So another area that we've worked on quite a bit at Stanford is
this natural language inference task that I'll talk about later in the course.
Um, but that's really about saying whether one piece of,
um, text is the conclusion of another,
um, piece of text.
And so that's sort of a way that you can try and see whether, uh,
a piece of text actually gives you a justification and answer to what the question was.
But at any rate, this trying to decide whether
you've actually got an answer or not is a quite difficult problem in many cases.
So here's an example from SQuAD, um, 2.0.
So Genghis Khan united the Mongol and Turkic tribes of
the steppes and became Great Khan in 1206.
He and his successors expanded the Mongol Empire across Asia,
blah, blah, blah, blah.
And the question is,
when did Genghis Khan kill Great Khan?
And the answer to that is,
you know, uh, there isn't an answer because actually,
Genghis Khan was a person named Great Khan and he didn't kill a Great Khan.
It's just not a question with an answer.
Um, but it's precisely what happens with systems is, you know,
even though these systems get high scores in terms of points,
they don't actually understand human language that well.
So they look at something that says,
when did Genghis Khan kill Great Khan?
Well, this is something that's looking for a date and there are
some obvious dates in this passage there's 1206, 1234,
1251 and well, there's kill,
and kill looks a little bit similar to destroyed.
I can see the word destroyed.
So that probably kind of matches.
And then we're talking about, um,
Genghis Khan and there,
I can see Genghis and Khan in this passage.
And so it sort of puts that together and says
1234 is the answer when that isn't the answer at all.
And that's actually kind of pretty typical of the behavior of these systems.
And so that, on the one hand, they work great.
On the other hand, they don't actually understand that much,
and effectively asking whether there's,
this question is actually answered in the passage is a way of
revealing the extent to which these models
do or don't understand what's actually going on.
Okay. So, at the time, um,
they built SQuAD version 2.0.
They took some of, um,
the existing SQuAD version one's systems,
and, um, modified them in a very simple way.
I put in a threshold, um,
score as to how good the final match was deemed to be,
and said, Well, how well do you do on SQuAD 2.0?
And the kind of systems that we saw doing well before,
now didn't do that well,
so something like the BiDAF system that we mentioned before was now scoring about 62 F1,
so that that was sort of hugely lowering
its performance and reflecting the limits of understanding.
Um, but it turned out actually that this problem didn't prove to
be q- quite as difficult as the dataset authors,
um, maybe thought either.
Um, because it turns out that um,
here we are now in February 2019,
and if you look at the top of the leaderboard,
we're kind of getting close again to the point
where the best systems are almost as good as human beings.
So, um, the current top rate system there you can see is getting 87.6 F1,
which is less than two points behind where the human beings are.
Um, the SQuAD version 2 they also co- corrected the,
um, scoring of human beings,
so it's more of a fair evaluation this time, um,
so there's still a bit of a gap but, you know,
the systems are actually doing, um, really well.
And the interesting thing there is,
you know, on the one hand these systems are impressively good.
Um, you can go on the SQuAD website and look
at the output of several of the good systems,
and you can see that there are just a ton of things that they get right.
They're absolutely not bad systems.
You have to be a good system to be getting five out of six of the questions right.
Um, but, you know, on the other hand they still
make quite elementary Natural Language Understanding Errors.
And so here's an example of one of those.
Okay, so this one,
the Yuan dynasty is considered both a successor to
the Mongol Empire and an imperial Chinese dynasty.
It was the khanate ruled by the successors of
Mongke Khan after the division of the Mongol Empire.
In official Chinese histories the Yuan dynasty bore the Mandate of Heaven,
following the Song dynasty and preceding the Ming dynasty.
Okay. And then the question is,
what dynasty came before the Yuan?
And that's a pretty easy question,
I'd hope, for a human being.
Everyone can answer that question?
Okay, um, yeah, so it says in official Chinese histories Yuan Dynast- uh,
sorry the next sentence.
Um, yeah followed- right the Yuan Dynasty following
the Song dynasty and preceding the Ming dynasty.
But, you know actually um,
this sort of the leading um,
Google BERT model says that it was the Ming dynasty that came before
the Yuan Dynasty which you know is sort of elementarily
wrong that reveals some of the same kind of it's
not really understanding everything but it's doing a sort of a matching problem still.
Okay. So, this SQuAD dataset has been useful and good.
It still has some major limitations and I just thought I'd
mentioned what a few of those are so you're aware of some of the issues.
So one of them I've already mentioned, right,
that you're in this space where all answers are a span from the passage.
And that just limits the kind of questions you can
ask and the kind of difficult situations there can be.
So, there can't be yes-no questions counting
questions or even any of the sort of more difficult implicit questions.
So, if you think back to when you were in middle school and did reading comprehension,
I mean, it wasn't typically um,
the case um, that you're being asked
questions that were just stated explicitly in the text of,
you know, Sue is visiting her mother in Miami.
And the question was,
who was visiting in Miami?
That wasn't the kind of questions you were asked you were normally asked questions um,
like um, you know,
um, Sue is going to a job interview this morning,
um, it's a really important job interview for her future.
At breakfast she um,
starts buttering both sides of her piece of toast um,
and you are asked a question like, um,
why um, is Sue buttering both sides of her piece of toast?
And you're meant to be able to answer,
"She's distracted by her important job interview coming up later in the day."
Which isn't the- something that you can answer um,
by just picking out a sub span.
Um, a second problem which is sort of actually a bigger problem is um,
the way SQuAD was constructed for ease
and not to be too expensive and various other reasons was um,
paragraphs of Wikipedia were selected and then,
Mechanical Turkers were hired to say,
"Come up with some questions um,
that can be answered by this this passage in version 1.1."
And then in version two they were said- told,
"Also come up with some questions that
look like they're related to this passage but aren't actually answered in the passage."
But, in all cases people were coming up with
the questions staring at the passage and if you do that,
it means that your questions are strongly
overlapping with the passage both in terms of the,
the words that are used and even the syntactic structures that are
used for your questions tending to match the syntactic structures of the passage.
And so that makes question answering um, naturally easy.
What happens in the real world,
is this human beings think up questions and
type something into a search engine and the way
that they type it in is completely distinct
from the way something might be worded on a website.
So that they might be saying something like,
you know, "In what year did the price of hard disks drop below a dollar a megabyte?"
Um, and the webpage will say something like
the cost of hard disks has being dropping for many years um,
in I know whenever it was 2004 prices eventually crossed um,
the dollar megabyte barrier or something like that.
But there's a quite different discussion of the ideas.
And that kinda matching is much harder and that's one of
the things that people have done other datasets have tried to do differently.
Um, another limitation is that these questions and
answers are very much, find the sentence that's addressing the fact,
match your question to the sentence,
return the right thing,
that there's nothing sort of more difficult than involves multi sentence,
combine facts together styles of inferencing,
that the limits of cross sentence stuff there is pretty much limited to
resolving co-reference which is something we'll talk about later in the class,
that means that you see a he or she or an it,
and you can work out who that refers to earlier in the, this course.
Um, nevertheless, despite all those disadvantages,
it sort of proved that SQuAD was, you know,
well-targeted in terms of its level of difficulty, well-structured,
clean dataset, and it's just been
sort of everybody's favorite for a question answering dataset.
It also seems to have proved that actually for
people who work in industry and want to build a question answering system,
starting off by training a model in SQuAD,
actually turns out to work pretty well it turns out.
I mean, it's not everything you want to do.
You definitely wanna have relevant in domain data and be using that as well,
but you know, it turns out that it seems to actually be a quite useful starting point.
Okay. So, what I wanted to show you now was a- is a concrete,
simple, neural question answering system.
Um, and this is the model that was built by here and I guess she was
sort of an Abby predecessor since she was the preceding head TA for CS 224N.
Um, so this system,
um, Stanford Attentive Reader it kind of gets called now.
I mean, this is sort of essentially
the simplest neural question answering system that works pretty well.
So, it's not a bad thing to have in mind as
a baseline and it's not the current state of the art by any means.
But you know, if you're sort of wondering what's the simplest thing that I can build
that basically works as a question answering system decently,
this is basically it.
Um, okay. So how does this work?
So the way it works is like this.
So, first of all,
we have a question which team won Super Bowl 50?
And what we're gonna wanna do is build a representation of a question as a vector.
And the way we can do that is like this,
for each word in the question,
we look up a word embedding.
So, in particular it used GloVe- GloVe 300 dimensional word embeddings.
Um, we then run an LSTM
forward through the question and then kind of like Abby talked about,
we actually make it a bi-LSTM.
So, we run a second LSTM backwards through the question.
And so then, we grab the end state of both LSTMs
and we simply concatenate them together into a vector of dimension 2D if,
if our hidden states of the LSTM are dimension
d and we say that is the representation of the question.
Okay. So, once we have that,
we then start looking at the passage.
And so, for the start of dealing with the passage,
we do the same thing.
We, um, look up a word vector for every word in
the passage and we run a bidirectional LSTM,
now being represented a bit more compactly um, across the passage.
But then we have to do a little bit more work because we actually
have to find the answer in the passage.
And so what we're gonna do is use
the question representation to sort of work out where the answer is using attention.
So this is a different use of attention to machine translation.
That kind of attention equations are still exactly the same.
But we've now got this sort of one question vector that we gonna be trying to
match against to return the answer.
So, what we do is we, um,
work out an attention score between
each word's bi-LSTM representation and the question.
And so the way that's being done is we're using this bi-linear attention,
um, that um, Abby briefly discussed and we'll see more of today.
We've got the question vector,
the vector for a particular position in the passage
to the two concatenated LSTM hidden states.
So they're the same dimensionality.
We have this intervening learn W matrix.
So, we work out that quantity,
um, for each position,
and then we put that through a softmax which will give us
probabilities over the different words in the passage.
Um, and those give us,
um, our attention weights.
And so at that point we have attention weights,
um, for different positions, um,
in the passage and we just declare that,
um, that is where,
um, the answer starts.
Um, and then to get the end of the answer,
we simply do exactly the same thing again apart from we train a different W matrix here,
and we have that,
um, predict the end token.
And there's something a little bit subtle here.
Um, because, you know, really we're asking it to sort
of predict the starts and the ends of the answer,
and you might think, but wait a minute.
Surely, we need to look at the middle of the answer as well because maybe the,
the most indicative words are actually going to be in the middle of the answer.
Um, but, you know, really really what we're,
we're sort of implicitly telling the model of well,
when you're training, if there's stuff in the middle that's useful,
it's the bi-LSTM's job to push it to the extremes of the span,
so that this simple bi-linear attention
will be able to get a big score at the start of the span.
And you might also think there's something
funny that this equation and that equation are exactly the same.
So, how come one of them is meant to know it's picking up beginning, um,
and the other at the end?
And again, you know, we're not doing anything to impose that.
We're just saying, neural network.
It is your job to learn.
Um, you have to learn a matrix here and a different one over there,
so that one of them will pick out parts of the representation that
indicate starts of answer spans and the other one ends of answer spans.
And so, that will then again pressure
the neural network to sort of self organize itself in
such a way that there'll be some parts of
this hidden representation that will be good at learning starts of spans.
You know, maybe there'll be carried backwards by
the backwards LSTM and and some parts of it will be good at
learning where the spans end and then
the W matrix will be able to pick out those parts of the representation.
Um, but yeah, uh,
that's the system. Um, yeah.
So, um, so this is
the basic Stanford Attentive Reader model and it's just no more complex than that.
Um, and the interesting thing is, you know,
that very simple model actually works nicely well.
Um, so this is going back in time.
Again, this was the February 2017 SQuAD version 1 leaderboard.
Um, but at that time, that provide- like,
it always in neural networks quite a bit of your success
is training your hyperparameters and optimizing your model really well.
And some time, you know,
it's been repeatedly proven in neural network land that often you can get
much better scores than you would think from
very simple models if you optimize them really well.
So there have been multiple cycles in sort of
deep learning research where there
was a paper that did something and then the next person says,
"Here's a more- more- more complex model that
works better," and then someone else published a paper saying,
"Here's an even more complex than that model that works
better," and then someone points out, "No.
If you go back to the first model and just really train its hyperparameters well,
you can beat both of those two models."
And that was effectively the case about what
was happening with the Stanford Attentive Reader.
That, you know, back in- back in February 2017,
if you just train this model really well,
it could actually outperform most of the early SQuAD systems.
I mean, in particular,
it could outperform, um, the BiDAF,
the version of BiDAF that was around in early 2017 and,
you know, various of these other systems from other people.
But it was actually, at that time,
it was pretty close to the best system that anyone had built.
Um, as I've already pointed out to you,
um, the numbers have gone up a lot since then.
So I'm not claiming that, um,
this system is still as good as the best systems that you can build. But there you go.
Um, so that's the simple system that already works pretty well,
but of course you want this system to work better.
Um, and so Danqi did quite a bit of work on that.
And so here I'll just mention a few things for, um,
Stanford Attentive Reader++ as to
what kind of things can you do to make the model better.
And so here's a sort of a picture of, um,
the sort of the improved system and we'll go through
some of the differences and what makes it better.
Um, there's something I didn't have before that I should just mention, right?
Sort of this whole model, all the parameters of this model are just trained end to end,
where your training objective is simply, um,
working out how accurately you're predicting
the start position and how accurately you're predicting
the end position so that the attention gives
you a probability distribution over start positions and end positions.
So you're just being asked what probability estimate
are you giving to the true start position and the true end position.
And to the extent that though,
you know, those aren't one,
you've then got loss that is then being sort of summed in terms of log probability.
Okay. So how is this model, um,
more complex now than what I showed before?
Essentially in two main ways.
So the first one is looking at the question,
we still run the BiLSTM as before.
Um, but now what we're going to do is it's a little bit crude
just to take the end states of the LSTM and concatenate them together.
It turns out that you can do better by making use of all states in an LSTM.
And this is true for most tasks where you
want some kind of sentence representation from a sequence model.
It turns out you can generally gain by using
all of them rather than just the endpoints or that.
Um, so but this is just an interesting general thing to know again because, you know,
this is actually another variant of how that- how you can use attention.
There are, you know, a lot of sort of the last two years of neural NLP can be summed
up as people have found a lot of clever ways to use
attention and that's been pairing just about all the advances.
Um, so what we wanna do is we want to have attention over the positions in this LSTM.
But, you know, this- we're processing the query first.
So it sort of seems like we've got nothing to calculate attention with respect to.
So what we do is we just invent something.
So we just sort of invent.
Here is a vector and it's sometimes called a sentinel or some word like that,
but, you know, we just in our PyTorch say,
"Here is a vector.
Um, we're going to calculate, um,
we initialize it randomly,
and we're gonna calculate attention with respect to that vector,
and we're going to use those attention scores, um, to, um,
work out where to pay attention, um,
in this BiLSTM, and then we just sort of train that vector so it gets values.
And so then we end up with a weighted sum of the time
steps of that LSTM that uh, then form the question representation.
Um, second change, uh,
the pictures only show a shallow BiLSTM but, you know,
it turns out you can do better if you have a deep BiLSTM and say
use a three-layer deep BiLSTM rather than a single layer.
Okay. Then the other changes in
the passage representations and this part arguably gets a little bit more hacky,
um, but there are things that you can do that make the numbers go up, I guess.
Um, okay.
So- so firstly for the representation of words rather than only using
the GloVe representation that the input vectors are
expanded so that- so a named entity recognizer and a part of speech tagger is run.
And since those are sort of small sets of values,
that the output of those is just one-hot encoded and concatenated onto
the word vector, so it represents if it's
a location or a person name and whether it's a noun or a verb.
Um, word frequency proves to be a bit useful.
So there's your concatenating on sort of a representation of the word frequency as,
um, just sort of a float of the unigram probability.
Um, and then this part is kind of key to getting some further advances which is, well,
it turns out that we can do a better job by doing some sort
of better understanding of the matching between the question and the passage.
And, um, this feature seems like it's
very simple but turns out to actually give you quite a lot of value.
So you're simply saying for each word in the question,
uh, so for each word- well,  I said that wrong.
For each word in the passage,
you were just saying, "Does this word appear in the question?"
And if so you're setting a one bit into
the input and that's done in three different ways: exact match,
uncased match, and lemma match.
So that means something like drive and driving, um,
will match, and just that sort of
indicator of here's where in the passage that's in the question.
In theory, the system should be able to work that out
anyway that explicitly indicate and it gives quite a bit of value.
And then this last one does a sort of a softer version of that where it's using word
embedding similarities to sort of calculate
a kind of similarity between questions and answers,
and that's a slightly complex equation that you can look up.
But effectively, um, that you're getting the embedding of words and the question answers.
Each of those, you're running through a single hidden layer,
neural network, you know,
dot producting it, and then putting all that through a Softmax,
and that kind of gives you a sort of word similarity score and that helps as well.
Okay. So here's the kind of just overall picture this gives you.
So if you remember, um, um,
there was the sort of the classical NLP
with logistic regression baseline, there's around 51.
So for sort of a fairly simple model,
like the Stanford Attentive Reader,
it gives you an enormous boost in performance, right?
That's giving you close to 30 percent performance gain.
And then, you know, from there,
people have kept on pushing up neural systems.
But, you know, so this gives you kind of in some sense three quarters of
the value over the traditional NLP system and in the much more,
um, complex, um, neural systems that come after it.
Um, yeah.
In terms of error reduction,
they're huge but it's sort of more like they're giving you the sort of,
um, 12 percent after that.
Why did these systems work such a ton better um, than traditional systems?
And so we actually did some error analysis of this and, you know,
it turns out that most of their gains is because they can just do
better semantic matching of word similarities
or rephrasings that are semantically related but don't use the same words.
So, to- to the extent that the question is where was Christopher Manning born?
And the sentence says Christopher Manning was born in Australia,
a traditional NLP system would get that right too.
But that to the extent that you being able to get it right,
depends on being able to match,
sort of looser semantic matches so that we understand the sort of um,
you know, the place of birth has to be matching was born or something.
That's where the neural systems actually do work much much better.
Okay. So, that's not the end of the story on question-answering systems.
And I wanted to say just a little bit about um,
more complex systems to give you some idea um,
of what goes on after that.
Um, but before I go further into that,
are there any questions on uh,
up until now, Stanford Attentive Reader?
[NOISE] Yeah.
I have a question about attention in general.
Every example we've seen has been just linear mapping with a weight matrix.
Has anybody tried to convert that to a deep neural network and see what happens?
Um, so yes they have.
Well, at least a shallow neural network.
Um, I'll actually show an example of that in just a minute.
So maybe I will um, save it till then.
But yeah absolutely, um,
yeah people have done that and that can be a good thing to um, play with.
Anything else? Okay. Um, okay.
So, this is a picture of the BiDAF system,
so this is the one from AI2 UDub.
And the BiDAF system is very well known.
Um, it's another sort of classic version of
question-answering system that lots of people have used and built off.
Um, and, you know,
some of it isn't completely different to what we saw before but it has various additions.
So, there are word embeddings just like we had before,
there's a biLSTM running just like what we had before,
and that's being done for both the um,
passage and the question.
Um, but there are some different things that are happening as well.
So one of them is rather than just having word embeddings,
it also processes the questions and passages at the character level.
And that's something that we're going to talk about coming up ahead in the class.
There's been a lot of work at doing character level processing in recent neural NLP,
but I don't want to talk about that now.
Um, the main technical innovation of the BiDAF model
is this attention flow layout because that's in its name bidirectional attention flow.
And so, there was a model of attention flow where you have attention
flowing in both directions between the query and the passage.
And that was their main innovation and it was quite useful in their model.
Um, but beyond that,
there's you know, sort of more stuff to this model.
So after the attention flow layer there's again
multiple layers of bidirectional LSTMs running.
And then on top of that their output layer is more
complex than the sort of simple attention version that I showed previously.
So let's just look at that in a bit more detail.
Um so, for the attention flow layer.
So, the motivation here was in the Stanford Attentive Reader,
we used attention to map from
the representation of the question onto the words of the passage.
But, you know so as questions are whole mapping onto the words of the passage.
Where their idea was well,
presumably you could do better by mapping in both directions at the word level.
So you should be sort of finding passage words that you can map onto question words,
and question words that you can map onto passage words.
And if you do that in both directions with attention flowing,
and then run another round of sequence models on top of that,
that you'll just be able to do much better matching between the two of them.
And so the way they do that is, um,
that they- they've got the bottom- so at
the bottom layers they've sort of run these two LSTMs.
So they have representations in the LSTM for each word and um,
word and passage position.
And at this point I have to put it in a slight apology because I just
stole the equations and so the letters that are used change.
Sorry. But, so these are the um,
question individual words and these are the passage individual words.
And so, what they're then wanting to do is to say for each passage word,
and each question word, I want to work out a similarity score.
And the way they work out that similarity score is they build a big concatenated vector.
So there's the LSTM representation of the passage word, the question word,
and then they throw in a third thing where they do a Hadamard product,
so an element-wise product of the question word and the context word.
Um, you know, for a neural net purist, throwing in
these kind of Hadamard products is a little bit of a cheat because
you kind of would hope that a neural net might just learn that
this relation between the passage and the question was useful to look at.
But you can find a lot of models that put in
these kind of Hadamard product because it's sort of
a very easy way of sort of having a model that knows that matching is a good idea.
Because essentially this is sort of looking for each question and passage word pair.
You know, do the vectors look similar in various dimensions?
You can sort of access very well from looking at that Hadamard product.
So that- so you take that big vector,
and you then dot-product that with a learned weight matrix,
and that gives you a similarity score
between each position in the question and the context.
And so then what you're gonna do is use that to
define attentions that go in both directions. Um-
So for the, um, context,
the question attention, this one's completely straightforward.
So, you put these similarity scores through a soft-max.
So for each of the i positions in the passage or sort of,
having a softmax which is giving you a probability distribution,
over question words and then you're coming up with
a new representation of the i-th position which is then the attention weighted,
um, version, the attention weighted average of those question words.
Um, so you're sort of,
having attention weighted view of the question mapped onto each position in the passage.
Um, you then want to do something in the reverse direction.
Um, but the one in the reverse direction is done subtly differently.
So you're again starting off, um,
with the- the same similarity scores but this time they're sort of wanting to, sort of,
really assign which position,
in which position in the question is the one that's, sort of,
aligning the most so that they're finding a max and so that they're finding
which is the most aligned one and so then for each of,
for each of the i's,
they're finding the most aligned question word.
And so then they're doing a softmax over these m scores and then those are being
used to form a new representation of the passage by,
sort of, summing over these attention weights.
Okay. So you build these things up and this then
gives you a new representation where you have,
um, your original representations of the passage words.
You'd have a new representation that you've built from
this bidirectional attention flow and you
look at these sort of Hadamard products of them and
that then gives you kind of the output of the BiDAF layer and that output of
the BiDAF layer is then what's sort of being fed as
the input into these nick- next sequence of LSTM layers.
Okay. Um, and so yeah,
um, so then that's the modeling layer.
You have another two BiLSTM layers and so the way they do the,
um, suspense selection is a bit more complex as well.
Um, so that they're then, um,
sort of taking the output of the modeling layer and putting it through a sort of
a dense feed-forward neural network layer and then softmaxing over that,
um, and that's then getting a distribution of
a start and you're running yet another LSTM kind of a distribution finish.
Um, yeah. So, that gives you some idea of a more complex model.
Um, you know, in some sense,
um, the summary if you go further forward than here is that, sort of,
most of the work in the last couple of years,
people have been producing progressively more complex architectures with
lots of variants of attention and effectively that has been giving good gains.
Um, I think I'll skip since time is running,
out, showing you that one.
But, um, let me just mention this FusionNet model
which was done by people at Microsoft because this relates to the answer,
the attention question, right?
So p- so people have definitely used different versions of attention, right?
So that in some of the stuff that we've shown we tend to emphasize
this bi-linear attention where you've got two vectors mediated by a matrix.
And I guess traditionally at Stanford NLP,
we've liked this, um,
version of attention since it seems to very directly learn
a similarity but other people have used a little neural net.
So this is, sort of, a shallow neural net to
work out attention scores and there's, sort of,
no reason why you couldn't say, maybe it would be even better if I
make that a deep neural net and add another layer.
Um, and some of, you know,
to be perfectly honest, um,
some of the results that have been done by people including Google
argue that actually that NLP version of attention is better.
Um, so there's something to explore in that direction.
But actually, um, the people in FusionNet didn't head that direction because they said,
"Look, we want to use tons and tons of attention.
So we want an attention computation that's pretty
efficient and so it's bad news if you have to
be evaluating a little dense neural net at
every position every time that you do attention."
So this bi-linear form is fairly appealing
but they then did some playing with it so rather than having a W matrix
you can reduce the rank and complexity of
your W matrix by dividing it into the product of two lower rank matrices.
So you can have a U and a V matrix.
And if you make these rectangular matrices that are kind of skinny,
you can then have a sort of a lower rank factorization and,
that seems a good idea.
And then they thought well,
maybe really you want your attention distribution to be symmetric.
So we can actually put in the middle here,
we can have the U and the V, so to speak,
be the same and just have a diagonal matrix in
the middle and that might be a useful way to think of it.
And that all makes sense from linear algebra terms but then they thought,
"Oh, non-linearity is really good in deep learning.
So why don't we, sort of,
stick the left and right half through a ReLU and maybe that will help.
[LAUGHTER] Which doesn't so much make sense in your linear algebra terms, um,
but that's actually what they ended up using as their, um, attention forms.
There are lots of things you can play with when doing your final project.
Um, yeah.
And, but, you know, their argument is still, you know,
that doing attention this way is actually much much
cheaper and so they can use a lot of attention.
And so they build this very complex tons of attention model, um,
which I'm not going to try and explain, um,
all of now, um,
but I will show you this picture.
Um, so a point that they make is that a lot of
the different models that people have explored in different years you,
that, you know, they're sort of,
doing different kinds of attention.
That you could be doing attention right,
lining up with the original LSTM,
you could run both sides through some stuff and do attention,
you can do self attention inside your layer that there are a lot of
different attentions that different models have explored.
And essentially what they are wanting to say is,
let's do all of those and let's make it deep and do it all
five times and the numbers will go up. And to some extent the answer is,
yeah they do and the model ends up scoring very well.
Okay, um, so the one last thing I just wanted to mention but not explain is,
I mean in the last year there's then been
a further revolution in how well people can do these tasks.
And so people have developed algorithms which produce contextual word representation.
So that means that rather than a traditional word vector,
you have a representation for each word in a particular context.
So here's the word frog in this particular context and the way people build
those representations is using something
like a language modeling tasks like Abby talked about,
of saying putting probabilities of words in
context to learn a context-specific word representation.
And ELMo was the first well-known such model.
And then people from Google came up with BERT,
which worked even better.
Um, and so BERT is really in some sense is
super complex attention Architecture doing a language modeling like objective.
We're going to talk about these later, um,
I'm not going to talk about them now, um,
but if you look at the current SQuAD 2.0 Leaderboard,
um, you will quickly,
um - sorry that's- oh I put the wrong slide and that was the bottom of the leaderboard.
Oops, slipped at the last minute.
If you go back to my slide which had the top of the leaderboard, um,
you will have noticed that the top of the leaderboard,
every single one of the top systems uses BERT.
So that's something that you may want to
consider but you may want to consider how you could
use it as a sub-module which you could add other stuff too as many of these systems do.
Okay. Done for today.
