[MUSIC]
Stanford University.
>> All right, hello everybody and welcome
to what I personally think is one of
the most fun lectures now we can
really just assume we have all these
basic lego blocks and
we can build some awesome stuff.
Namely, dynamic neural networks for
question answering.
In terms of organization,
there's not too much going on today.
Except, hopefully, all of you are working
hard on PA4 and on your projects.
I'll have my project advice office
hour today right after class.
QueueStatus 90.
I'll be on hangout and everything.
Any questions around the projects?
Where you want to be?
Nope?
All right, so let's get started.
Now why is PA4 so open ended and why do we
think it's so exciting as a problem set?
Basically, the question
that I wanna ask today, and
maybe you can also ask yourself
during your projects and
your problem sets is whether
we may be able to cast all NLP
tasks actually as a question
answering problem.
And this question is what led us to
invent the dynamic memory network,
which we'll talk about today.
So what do I mean by this?
Let's go through a couple of examples of
what question answering might look like.
The first one here is sort of the standard
question answering type of problem,
where we have some inputs,
we have a question, and
we need to do some logical reasoning
maybe to answer that question.
So for instance, here the inputs
are Mary walked to the bathroom,
Sandra went to the garden,
Daniel went back to the garden.
Sandra took the milk there and
now we ask where is the milk.
Now, in order to do this, we'd have to
actually do a little bit of reasoning
because if you just try to
retrieve the sentence or
the last sentence that mentions milk.
Well then, it will tell you
sender talked milk "there" and
you don't know where Sandra is and
hence what "there" refers to.
So now I have to do
some effort resolution.
If we were to try to build sort of a hand
tuned old school machine learning NLP
system where we put our human knowledge
into the task to answer that question.
And if we were to do that then we'd
realize, all right let's try to find out
where Sandra is and then we look at
this sentence and see that she is
in the garden last and then we could
answer the question correctly.
Now, by the end of this class, you will
know a model that you don't have to give
any of that kind of information to.
It will actually just learn all of
this from examples of this kind.
Now that's a standard Q&A problem, but
you can also look at sentiment
where we might have an input here.
Everybody is happy.
And again we have the question
with the sentiment at the end, so
it's just the label.
Essentially the word pertaining
to that label, namely positive.
Now we can go further.
We had the task of sequence tagging,
nicknamed entity recognition and
part of speech tagging.
And we can also ask what are the named
entities and then we might want to
obtain either a list that could include
a lot of Os for not the named entity.
Or just a list of the actual
words that are named entities.
And again, input question-answer,
triplet, that we would need for
training, that kinda model.
Same with what are the part
of speech tags.
Every word has a part of speech tag, so,
that's a sequence of the same
length as the input.
And we can even go as far as and
this is starting to be question
of how useful it is but
where we can cast also machine translation
as a question answering problem, right?
In the end most every NLP problem
has some input, some questions and
answer about it and
some output some answers.
So now with that in mind,
wouldn't it be amazing if we were able
to build a single joint model for
general question answering.
Can basically learn from any input
question answer triplet dataset.
Now towards a single joint model for
discompetence of QA there're
two major obstacles.
So lets talk about what they are.
The first one is that we don't even have a
single architecture, or up until last year
when we published a paper that solves all
these, there was no single architecture
that consistently got state-of-the-art
across these different tasks.
So for question answering we have
strongly supervised memory networks.
Sentiment analysis.
We actually used tree structured LSTM so
similar to the recursive neural networks
that Chris talked about two lectures ago.
But using the ideas of having various
gates as you combine words and
phrases into phrase
vectors of longer length.
And for part of speech tagging it used to
be a bidirectional LSTM conditional random
field, another type model family that we
didn't go into any details in this class.
And sort of in the graphical models world.
What you do notice is that all the state
of the art models have some kind of neural
network in them somewhere these days.
That's one of the reasons we
merged from 224D into 224N.
Felt like you really need to know
these basic building blocks.
But there are still different kinds
of models that make different kinds
of assumptions.
And we would call these different
kinds of architectures all right.
Like so the architectures
that we've talked about so
far are sort of bag of words,
window-based.
Convolutional neural network, recurrent
neural networks, LSTMs and so on.
So that's the first major obstacle.
Now the second one,
is that fully joined multitask
learning is actually incredibly hard.
And what I mean by saying that, is that
we don't just wanna share some parts
of the model like word vectors, which
we're already pretty good at sharing.
But we wanna have the exact
same decoder or classifier.
And we don't wanna just transfer between
a single source and one target task but
we wanna ideally train
both of them jointly.
So in computer vision I'll encourage you
all to take 231 next quarter I think
when it's getting offered on convolutional
neural networks for computer vision.
Computer vision's actually
better at able to share more of
the layers as you go up the stack.
Whereas an NLP, so far, when you try to
do multitask learning and share weights,
what we've been mostly able to do so
far is to share the word vectors.
And then we train the word vectors and
then we initialize some other model
with those pre-trained word vectors.
Nobody yet consistently got improvements,
though it's an active area of research,
to share deeper layers of the LSTM,
for instance.
So we trained the system
on machine translation, and
then just changed the last layer of the
soft max email send machine can analysis,
or something like that.
It hasn't been any paper yet
on showing improvements for that.
And even worse,
it's hard to publish negative results and
you'll only ever read
about successful transfer
learning cases because the unsuccessful
ones don't get published.
But when you actually do research in
this area you'll notice that as soon as
the tasks are not directly related,
they actually tend to hurt each other.
So if you're trying to train two
tasks together in one model,
say you just have two soft maxes on
the same Hidden state of your LSTMs.
It turns to actually get
worse in many cases, too.
So those are some of
the two major obstacles.
Now, the dynamic memory networks
that I'll talk to you today about
only tackle the first obstacle.
As in their an architecture which
still has hyper parameters.
That might differ for different tasks, but
it's at least the same general modules
that you have in that architecture.
And basically it can tackle all these
different tasks that I described to you,
at least in some capacity.
And several of them actually
at the state of the art level
at the time of publication.
Any questions around these obstacles and
sharing and
the idea of multitask learning?
So basically, we're thinking of
multitask learning through the lens
of seeing everything as a question
over some kind of input.
That has some kind of answer.
So let's look at the high level idea for
answering these tougher
kinds of questions.
So imagine you had this story each of
the facts is relatively simple and
straightforward.
John went to the hallway and
things like that.
But imagine I now asked you,
after you read it, where's Sandra?
And you might have to actually try to
retrieve the episode in which that fact,
or the answer to that question
was actually mentioned.
And so, in some cases, as we saw with
the football, you may actually have to go
multiple times over the input to
answer that question correctly.
This is kind of what led us to this
idea of Dynamic Memory Network.
This is the kind of architecture you
would see a lot in papers, and so
we're going to walk through it
first on a high level intuitively,
and then we'll zoom into the different
areas to gain a better understanding.
Now, the first thing you will notice is,
we have different modules.
This is what I call sort
of model components.
And the reason we're basically
separating them out is that, I think,
eventually, you have to do deep learning
research and also general engineering.
Through, or with the help of good
software engineering practices.
Where you have different modules,
they define interfaces and
you might be able to switch out one module
with another one, but that doesn't mean
you have to change all the other
modules in your larger architecture.
So that's generally a good
sort of modeling framework.
Now, what does this
dynamic memory network do?
So it starts with having simple word
vectors like glove or word2vec.
And basically we'll have a recurring
neural sequence model such as
GRU that goes over the input and
just computes the hidden state at
every word and at every sentence.
So a standard GRU that we have
defined very carefully before.
Now, that's just independent
of the question.
It will just basically compute
a hidden state for every word.
Now, it will have also a GRU for
the question in the question module.
In fact, sometimes, you can share also
the waits between the question and
the input, here, these GRUs.
And we'll basically use a GRU
to compute a question vector.
That question vector Q is just
going to be the last hidden state
of the GRU after it went for
every word of the question.
Now, and this is where
the interesting stuff starts.
We'll use these attention mechanisms,
that you've learned about
before in a very specific way.
We will define it soon, but we'll
have this attention mechanism that is
essentially triggered by
the question to go over the inputs.
So here the question is,
where's the football?
And now basically we would assume that
the fact that football was mentioned
in this question is stored somewhere
inside the hidden state of.
That last time step GRU and
we call that Q.
And we use Q to essentially trigger an
attention mechanism over all the potential
inputs.
And whenever there is a strong attention
being paid to a specific sentence,
we're going to give that sentence
as input to yet another GRU.
And that is in the episodic memory module.
Now, whenever we have
a line like this here,
we basically assume that is a some kind of
recurrent neural network sequence model.
So basically, a question triggers
an attention mechanism that
goes over all the hidden
states of all my inputs, and
now basically says this fact seems
relevant for this question at hand.
So for instance, where is the football?
Well, the last sentence that
mentions football seems relevant.
And the hidden state of
this GRU captures that
there is something mentioned
about a football and
the hidden state of this GRU captures
that there's something about football.
So we restore this and
now this GRU agglomerates
only the facts that are pertinent or
relevant to the question at hand.
So it's essentially a filtering
GRU that tries to only keep track of
what's relevant for the current question.
And now, we'll define this memory state
here as the last hidden state of that GRU.
Now, for the next iteration, because
again here, John put down the football.
We don't know where John is, so
we now have stored in this vector m,
that in order to answer the current
question it seems pertinent, and
again the model learns all of
these things of course by itself.
We don't manually tell it these are
objects, these are people, or colors, or
anything like that, but we basically now
store in this vector M that John and
football seem relevant to the question
at hand, namely, where's the football?
And so, as we go over the input again,
we'll take M and
Q into consideration in order
to answer this question, and
basically pay attention to now every fact
that mentions both John, or the football.
And so, in this case here and
these are realistic numbers
you basically pay a lot of attention
to John move to the bedroom and
John went to the hallway and that is
the last sentence that seems relevant.
And so, the attention scores for
a subsequent and
previous sentences here are very low.
Restore this, and
now M is giving us input, and so
the zero time step to give to another G or
U that then just has
our standard soft max classifier at
every hidden state to produce an answer.
So that was a lot to digest,
it was probably the most complex model.
We've looked at so far, but
all the components of this model
we've already discussed, but this is
kind of where our research is right now.
There are lot of folks who are trying
to find new kinds of architectures for
different kinds of problems.
In this particular case, we're trying to
find a general architecture that should be
possible or should be usable across
a whole host of different kinds of tasks.
So we'll zoom in in a second into
these different modules, but are there
any questions about the general idea
of the model, the general architecture?
Great.
So what are the two different tracks of
the episodic memory module?
Essentially, these tracks are mirrored
perfectly with the input.
So there are always as many time steps in
the episodic memory module here as there
are time steps in the input.
But the model either can
decide with a classifier or
just goes over the input
of fix number of times.
Somewhere between two to five.
And every time it goes over it,
it tries to basically pay
attention to different kinds
of sentences in the input.
Depending on what the question is and what
it has so far agglomerated in terms of
relevant facts from the previous time
step, or previous episode in this case.
So here, again, we go over it,
the input, the first time.
Store things and
facts about John and the football.
And then, the second time, we'll now
pay attention to John facts, too.
And so, intuitively again, the first pass,
I ask where's the football?
And at this state here, John moves to
the bedroom just doesn't seem relevant
to ask the question,
to answer the question.
Cuz John moving to the bedroom has
nothing to do with the football.
So you have to, in order to do
transitive reasoning, as in A to B,
as in B to C, and C and D.
Now, if you wanna go from A to D, you need
to understand that steps to get there.
So this kinda transitive
reasoning capability,
you can only get if you have
multiple passes over your info.
So the question is, if we had some
adverbs or temporal expressions and
asked sort of different kinds of
questions like was John there before or
various other kinds of questions,
like that.
And the answer is, there's
surprisingly many things it can learn,
if it has seen them in the training data.
So this kinda model will not be able
to come out of completely new kinds of
reasoning and types of reasoning, but if
you show it a lot of temporal reasoning,
it will be able to answer
those kinds of questions.
So I don't see why it couldn't,
for some theoretical reason,
answer the questions of where was he
before that, or things like that.
That's right.
So m1, so it's m superscript 1,
that's right.
So the question is, in the first
pass we mostly use the question.
It turns out we'll actually copy
the question twice and in the second pass
we'll replace the second copy of
q with m1 but sort of a detail.
But in general, yes, we'll only use
the information of Q for the first pass.
Once we have the second pass, we'll use
m1, and Q, to understand the attention.
And again, that way we hope that
m is agglomerating the facts
that are relevant to answer the question.
Great question.
So now,
let's zoom into the model in detail.
In some cases, it's pretty easy.
We basically just have a standard GRU, but
maybe, before I define these
sort of different modules,
in the end here we have, again, this
softmax, and so whenever you see softmax,
you can think of cross-entropy
error as your training objective.
And now because of how
we define the modules,
we constrained this entire
architecture end-to-end.
You just have to give it an input
sequence, a question, and an answer.
And now from this answer here we basically
will make errors in the beginning and
then we'll have high cross entropy error.
And then, we basically backpropagate
through all these vectors here,
through everything all the way
into the word vectors, and
we can train this whole system end to end.
And that's really where
the power of deep learning and
these kind of architectures comes in.
So the question is, should we have some
attention from the answer module also sort
of skipping the episodic memory module's
attention and going directly to the input?
And the answer is, yes, for the tasks
that we had, we did not even try that,
cuz we solved them to the state of
the art level and slightly above.
But what we actually have found is
that it makes sense in the second
iteration on the harder data set and
namely those Stanford question answering
data sets that you are all gonna work on,
to actually have a co-attention mechanism
where you want to re-interpret
the question also in terms of the input.
So for instance, intuitively,
if you have a question, may I cut you?
Then, the interpretation is very different
if the person asking the question
is holding a knife, or if somebody is
standing in line in a supermarket.
And so, sometimes the interpretation
of your question is actually different
depending on the input.
And so,
you want to have co-attention mechanisms.
Essentially, attention is kind of
a fun concept right now to do.
A lot of people are sprinkling
into a lot of different places and
a lot of different kinds of models.
And I think this would be a very fine
way to do it, or a place to do it.
So we train everything end-to-end,
cross entropy errors, standard stuff.
Now, the input module is a standard GRU.
And again, in this particular case here
where we have very simple sentences and
facts about each sentences, we'll actually
make that last hidden state of each
sentence explicitly accessible
when we answer the question.
Now, one improvement that we've
made on the second iteration is
instead of just having
a unidirectional GRU we actually have
a bi-directional GRU on top here and
then each sentence is represented
via the concatenation of both the left
to right and right to left direction.
Now, the question also
just a standard GRU.
We have here,
word vectors we call them v_t here.
And basically we just get q_t and
then we'll drop the subscript t for
the last hidden state,
just to simplify notation and
subsequent slides.
Now where it gets interesting is that
episodic memory part, and we've already
seen some attention mechanisms, and
this one is slightly different.
Basically, we'll have
the attention mechanism here,
which I'll define in the next slide, of G.
G is just a single scalar number.
Should I pay attention to this sentence?
At the ith time step or not.
Now, the superscript here
refers to the iteration, or
the episode, the tth time that
we went over the entire input.
So we start here with t equals 1,
and then we go up to t equals 2.
And if we go over the input five times,
which, of course,
is also kinda slow,
we'd have here "h superscript 5." But
the main idea is essentially that we have
a global gate on top of a standard GRU.
So this global gate will basically say
this entire sentence does not matter or
does matter a lot.
And instead of having every single
hidden state have its own gate
we just turn off the entire GRU
if a fact doesn't seem relevant.
So if g_i, at the ith sentence,
and the tth episode and
pass over the entire input is 0,
what we'll do is
basically just entirely
copy that h vector forward.
No computation really necessary,
no updates to any of the units whatsoever.
So intuitively, that makes a lot of sense.
If we have sentences like Sandra
went back to the kitchen and
we're asking about the football, and
maybe eventually we'll figure out
something about John, and so on.
It's completely irrelevant to
have facts about other people
going to other places and that's really
easy to capture with this single scalar.
Now the last remaining question is well,
how do we compute that g and
the main idea here is actually
fairly simple and straightforward.
We essentially compute this vector
z here with a bunch of simple
similarities between the sentence sector.
That's again, the hidden state
of at the end of each sentence.
The question back there and our memory
stayed from the previous iteration and
m0 here would just be
initialized to be the question.
So the very first iterations
just have these two twice.
But that way, we can use the exact
same mechanism in the first,
second and third iteration and higher.
Now, what kinds of similarities
are we measuring here?
These are all element-wise and
these are just how to mark products,
so multiplicative interactions
between the sentence and
question and
each sentence at the memory state.
And then here, we just have elementwise
subtraction and absolute value.
So just basically,
two very simple similarity
metrics between the three
vectors that we have.
Yeah.
Is there a reason why we don't use
the inner product for similarity?
We actually tried it.
And in the first version of the paper,
we even had things like question
transpose times W times a sentence for
instance so this way we have a and
even more powerful similarity
matrix that's multiplicative but
can weigh different elements
of the vectors and so on.
It turned out after we've done some
replacement studies, again, something
you should all do for you projects too, if
we remove that, we had the same accuracy.
And whenever you can remove
something from your model
you should, so
we just got to take that out.
Now once we have this feature vector here,
this essentially it's a vector that
has all the similarities between
these different sentences.
We just plugged that into a standard
true layer neural network.
Standard tanh hidden units were very
familiar at this point with these
equations.
We have a linear, they're here and then we
basically put a softmax on top of that,
so all the attention scores sum to 1.
In some ways,
this could actually be a limiting factor.
Cuz that means that I can
only pay attention to so
many things very, very strongly each
time I go over the data set and
we might not always want to do that.
Maybe we want to pay 100%
attention to five different facts.
Turns out to work reasonably well for
some data sets but
sometimes you might also instead of having
a single softmax have just sigmoids,
so you can pay a lot more attention
to a lot of different things.
And then at the very end here,
these two lines turn out to also be a GRU,
but one that won't have
very many time steps.
It's basically a GRU that goes from each
of the memory states to the next memory
state and agglomerates the facts that
have been agglomerated over time here.
Turns out that is actually
not an important parameter,
eventually replaced that
GRU standard rectified
linear unit type of two layer neural
network and that worked well too, but
first iteration of the model had a GRU
between these two states as well.
Any questions about
the episodic memory module and
how we agglomerate facts,
how we compute the attention?
Cool.
Could you raise your hand if you
feel like all these modules and
how we put them together
make sense to you?
To the people who haven't raised their
hand, can you formulate any kind of
question around, why it doesn't
make sense or what confuses you?
Yeah.
That's exactly right.
So the question is on a very high-level,
we're going over the input multiple times.
Because every time we go over the input,
we can learn to pay attention
to different kinds of sentences.
That's right.
So basically, intuitively here,
I'll just rephrase your question
in the answer which is basically,
when I go over the inputs
the sentences here, s_i for
the first time maybe s_i here captures.
Facts about John, but
my question here is about football.
So in the very first iteration and
m^{t-1} (m_0) is just initialized to
the question too, so it's essentially not
adding anything either cuz we haven't
gone over the inputs an entire time yet.
So, we really have these two factors.
The sentence which basically
in your hidden state,
we hope capture some fact about
John having move to the hallway or
somewhere, but the question is
just asking where the football is.
So the similarity between this vector and
this vector is not going to be very
high and then we plug this long feature
vector into this two-layer neural network,
but no matter what, the two sentences
just don't seem very related.
And so, this two layer neural
network will not learn or
be able to identify that the sentence
seems relevant for the question.
And so this gate g here will just be very
small, but then in the second iteration
with basically edit one sentence
that connect the John and hallway.
So now in our hidden state m that
had gone through this GRU here, so
that last hidden state, we define
here as m now captured from the very
first iteration that John put
down the football seems relevant.
So now m has, in its hidden state, some
facts about both John and the football.
And now, those similarities can be
picked up by this attention mechanism.
And basically now, in the next iteration
as we move over the sentence again
give a higher attention score to
that sentence that mentions John.
That's right, a sentence is just a GRU or
some averaging.
That's a great question.
So there are a bunch of different ,I will
get to this in a second, don't worry.
There are a bunch of different kinds
of things that people have tried and
checked if this works and
one of them is actually Basic Coreference.
So being able like requiring to
answer a question that it is he And
then asking, who does he refer to,
then finding the right person.
And so, the model can actually do this
very accurately as long as it sees that.
And so, the way it would do that is
just basically noticing how inside
here, you mention John, or you mention he,
and then it learns to just go back and
find the next previous
kind of sentence that
mentions any kind of name, for instance.
And it could learn more complex
patterns than that too.
But, again, it would have to
have these multiple passes to be
able to now go back to
something that only it
didn't make sense in the first time you
went from left to right reading it, yeah.
But you would hope that the sentence
vectors capture what is in the sentence.
That's right, so
sorry to rephrase what you said,
I guess it wasn't quite a question but.
What important here, yeah, as the line
goes through, the last hidden state
of s_2 of that second
sentence is the input and
the h_0 of the next sentence,
so it's one continuous GRU.
And you would hope that
as it mentions John it keeps track
of that in one of its hidden states.
Something happened about John,
and then as it reads in "he" and
updates our hidden states we would hope
that it captures something about the two
being connected now.
That's right, so the question is has there
been a study of using this exact model
on coreference resolution and
the answer is yes.
We actually played around with it.
And there are some data sets where
we actually did really well on, but
we also had to modify the model slightly.
There's some various tips and
tricks that will be a little bit
outside the scope of this lecture.
So, there's nothing in theory of why this
couldn't work at all for coreference.
The main problem is that there's a lot
of different kinds of patterns and
you need a lot of trained
data to show the model.
What kind of patterns you might
wanna capture for co-reference?
And then, the main problem in co-reference
is that you might want to have an answer
for every pair of words.
In saying could these two words
quote "refer to each other or not".
And [INAUDIBLE] so
there is some issues but
there is no reason of why this
model in general couldn't do coref.
The main tricky bit in that why you needed
extra modifications is to squeeze coref
into a question and answering problem,
which is not very intuitive.
The question would then be so
for instance,
let's say you have this whole sentence and
they're a bunch of couple of he's and
she's in there, and
now you ask what does he refer to.
And you wouldn't know which he
even mean from the question.
So then, you have to say,
what does he refer to?
And then we'd have to give
sort of a indicator function
to which he were actually
caring about in the input.
So, those are the kinds of changes
you may have to make to the model to
do coreference
So, that one seems very easy.
So the question is, what if the input
was John moves to the bedroom, and
the question is where did John move to?
And yeah, in this case, it just needs to
pay attention to that one sentence and
can immediately agglomerate that.
And then, you just wanna make sure it
doesn't ignore it in the next couple of
iterations if you have a fixed number
of passes that you have over the input.
And then,
you can just output that bedroom.
But maybe your answer is,
it's never seen the word bedroom before.
I see, yeah.
So, if they're completely
new words that describe new
existing kinds of relationships,
it would also have some trouble.
In this case, it probably would still
work because it doesn't really care about
that many things in between, unless now
you have certain things like John slept
in the bedroom versus John went to
the bedroom and now you have different
kinds of questions and it needs to know
what the actual action and verb is.
And then, if you don't have that in
a trained data, then it couldn't do it.
But in general,
this kind of model struggles with
these things that i mentioned,
which is this thing right here,
in the first version of this
model is still a softmax.
So, if you've never seen a certain answer
at training time, the word hallway or
bedroom, maybe they've only went
to kitchens and living rooms or
something that, it would never be
able to give you that kind of answer.
But they're now ways to extend these kinds
of models with the pointer sentinel idea
and generally pointers that learn to
point to certain parts of your input and
that's one way of fixing that problem and
you'll implement pointers I think for
your PA4 as well.
But a lot of the other kinds of
ideas are not that unreasonable.
For those kinds of models, too.
So, the question is how do
we compute the m vector?
So the m vector is going to be
defined as the last hidden state
of this time sequence model,
which inside has a GRU,
but also has this global
gate on top of it.
So, for simplicity for instance,
if the fact is very relevant based on
this attention score g, then h will
just be computed as a standard GRU.
All right, so
this is just a single scalar.
That single scalar is one
then we'll just have it GRU.
Now the last hidden state of this GRU
as it goes over the inputs we'll just
define as m.
So that's for one pass.
But then, the second pass will actually
take that last hidden state and give it
as input together with the previous hidden
state and actually, never mind, yeah.
Let's just assume that
that's your m state.
There are lots of modifications
that you can make to this model but
they're not that relevant and
they usually only change your accuracy.
1 to 2 or 3%, so
the simplest iteration is just going to
be going through these m's independently.
And then, you can try various options of
incorporating this last hidden state
m also at the very last time stamp.
But you can kind of- All right, awesome.
So, the answer module also adjust the GRU,
but one little trick here,
which is We don't just have the previous
hidden state at, and some word
vector cuz there are no word vectors,
there are no inputs for the answer module.
And so, what we have instead is will
concatenate the question vector
at every single input state here,
as well as the previous words output.
So if we have this longer sequence
of things that we're generating,
then we have here the previous
word that we generated each time.
So this is similar to mission completion
for instance, where we give it
as input every time the word we just
generated in the ten step before.
Cool, now there's a bunch of related work.
Many of these papers we've talked about,
well actually not that many, some of them.
So Sequence to Sequence learning,
is one such model.
We didn't really cover Neural Turing
Machines in some of these other models
here, with somewhat over-promising titles,
like teaching machines to read and
comprehend.
I don't think any of these models
are really comprehending that much,
or reading in the sense that people read,
but
it's similar kinds of models
that have memory components.
As they go over different kinds of text.
The one that's most relevant and
was actually developed in peril
to this dynamic memory network, is
the memory network from Jason Weston, and
the extension to them in
end to end memory networks.
And so it makes sense to look a little
bit at the differences between these.
Basically, both of these kinds of
models have mechanisms to compute
some representations for input,
then scoring attention and responses.
The main difference is that for
the memory networks,
the input representations
are bag of words.
And then have some nonlinear or
linear embeddings that explicitly
encode the position of each word.
And then the memory networks interfere on
a variety of different kinds of functions,
both for attention and responses.
So essentially,
each of these four components is
a very different kind of network.
And it's not just a sequence model,
and so the dynamic memory
network here really takes as its core,
a neural sequence model.
It could be a GRU which we've
tried compared to LSTMs,
turned out LSTMs got the same performance,
but have more parameters.
So we use just a GRU, and
what's nice about this is that,
that will naturally capture that we
have a temporality in our sequence.
So if we asked did this happen before,
or if we want to do sequence tagging,
things like that we can immediately
do that with this model.
So we have basically much broader range of
different kind of tasks that we can solve
with this model.
Now, before we get to evaluation,
we'll have a research highlights.
>> Cool, hi everyone, so first of all
happy belated International Women's Day
and so for the recent highlight,
I'm going to present the paper.
Learning Program Embeddings
to Propagate Feedback on
Student Code by Chris Piech, et al.
Great.
So,
if you remember your first days of
programming and for some of you,
if you've taken one of six-eight here,
you'll probably remember Karel.
You'll know how important it is to
get feedback from your teachers,
about what your coding to
become a better programmer.
But now let's imagine that you're
teaching a class that has,
let's say,
an online course with a million students.
How do you actually make sure that
you give feedback to everyone.
Wouldn't it be nice if you just gave
feedback or if you graded, let's say,
about 100 assignments and
you could propagate that feedback to
the other students in that course?
So that's kind of the motivation for
creating program embeddings.
Yeah, the idea's that you want to be
able to cluster on these programs and
together by like similar in some ways.
So, and you all know that we can
do that with flagged sentences.
We've seen that a lot in this class.
And the question is, can we also represent
programs with vector representations,
such that these vectors capture something
about the functionality of the code,
even if the code, let's say,
crushes or doesn't really compile.
So you know how to encode sentences.
You've seen lots of different
architectures to do that.
Usually, we train them on some task,
which requires and bottling the language.
And then we can use that neural network
to create embeddings and for new imports.
So can we do the same thing for
computer programs?
And here this is what's presented
in the paper, and this is for
a very simple program you can
see the program in the middle.
It's just like to put a beeper and
move, and
then on the left side you can see that
there are two precondition states.
So that means just define what
precondition, postcondition means.
So, pre-condition is a current
state of Karel world.
For example, Karel is in the first square
and there is no beeper to world, so,
that could be like a state, so,
that's a pre-conditioned and then,
once you execute a program
we get to a post-condition.
It's like where Karel ends up
after we execute that program.
So, as you can imagine, for
one program we can actually have
a lot of different pre-conditions and
then once you execute program,
you have different post-conditions.
So here you see two samples
of P_1 as one precondition,
once we execute the program,
we get to Q_1.
We also have another example,
P_k here, which brings us to Q_k.
And you can imagine we have lots of these
pre-condition post-condition pairings.
And so yeah, so the first step in this
model is that we want to encode the state.
So we encode the precondition state.
We get a vector presentation for
that state, then we
apply the matrix M_A which is actually
the embedding that we're trying to learn.
So we apply that and
then we get and after from that,
which we are then decoding into what
we predict to be the post condition.
So it's in some ways similar
to the approach that we
see at the beginning of
the class with word2vec where we
are trying to learn this feature
matrix like M_A in this case.
So the goal is that M_A
captures something about
the meaning of this particular program.
And you can also see that,
like the encoder and decoder,
they follow very similar structures
that you've seen in class.
We apply a linear function with
a nonlinearity around it, and also for
decoding, it's very similar so we use
the output from our cell in the mddle and
then we, again, apply a linear function
with a nonlinearity afterwards.
So the last function here
mainly consists of two parts.
We have a prediction loss, which measures
how well we can predict the post condition
given our P prediction and the program.
And we also have an autoencoding loss,
which measures how good our encoder and
decoders are.
So in a way like if we have,
get and code us and decode us,
we should be able to also
reconstruct our precondition.
And the last term is just regularization,
which you also have seen
many times in this class.
[COUGH] So but if we have like a complex
program, we would like to actually combine
or we can't necessarily train the previous
model on all different kinds of programs.
So we want to be able to use
these building blocks, and
then create our representation form or
complex program.
And here, so you've already seen
recursive neural nets in this class, so
we can use recursive
neural nets to do that.
And the cool thing about programs is,
they're already in a tree structure, so
we don't even have to create a tree
structure but it's already given to you.
So given that tree structure, we can
reconstruct a recursive neural network
which exactly follows the same structure.
And then we combine these embeddings
that we've learned from the first task
together recursively
until we reach the root.
And the ideas that embedding
like the activation at the root
in this case is the one thing
that's blue at the top.
That representation should contain
something about the meaning,
the logic of the entire program.
And we can also train this
recursive neural network because
when we're combining these embeddings
together, we also multiply them with
parameters which we can learn
by training on an objective.
Okay.
So
just to summarize what
this paper presented, is
a neural network architecture which allows
us to encode programs as a mapping from
precondition spaces to a postcondition
space, using recursive neural networks.
And the advantage of using recursive
neural networks in this case,
is that we are also using
the structure of the program itself.
And once we have these learned
representations, we can use them for
lots of different tasks.
So going back to our initial motivation,
we can use them to cluster students
by their similarity of programs.
So let's say once you've waited around 100
students, we can use that feedback and
give it to other students as well.
We have similar programs.
We can essentially to feedback prediction
part and another application is to
actually perform knowledge tracing
over multiple code submissions.
So an example of that is when you're
solving your programming challenge,
you might actually submit multiple times.
It'll be interesting to
actually see your trajectories.
So as a teacher, you can oftentimes
see how the student's progressing and
you can see if the student actually
understood what they were programming or
if they just kind of
randomly guessed the code.
So this is actually still ongoing research
by something that I've actually
been working on with Chris.
So for example, possible interventions
that you might want to predict,
an online course.
When should you give a hint to the
student, should you show a motivational
video right now or should you maybe
choose a different next exercise?
And in order to do that, one thing that
you could do is given all these program
submissions by the student, you convert
them into the program embeddings,
using the approach that we just saw.
And then you can feed them into
another recurring neural network and
then predict future student success.
Yeah, so and that's it,
if you have any questions,
feel free to find me after class and
I would love to chat about it, thanks.
>> Thanks Lisa.
That was great.
>> [APPLAUSE]
>> All right, that was awesome.
So now,
let's look at the various tasks that we
can apply to dynamic memory network, too.
So the first one here is the babI or
babI data set, it's actually not
a great data set in some ways because
it is actually a synthetic data set.
Which is in some ways a big no-no for
a lot of NLP people.
Because that was what the field
had done many, many years ago.
We've moved past that since, and
actually can use real language.
But I still think it was
an interesting data set for a while.
We've solved it to such a high degree at
this point already that it's not that
interesting anymore.
We've sort of solved it as
a community fairly quickly.
But it's interesting in that it gives you
a lot of necessary, but not sufficient
conditions for systems to be able to
answer certain kinds of questions.
So simple things like counting, like X
goes into the room, Y goes into the room,
how many people in the room?
Right?
So it's very simple and
if you give it enough examples,
it will be able to predict some number
between 1 to 10 or something like that or
a simple negation like John
did not go into the bathroom.
Is John in the bathroom?
No.
Things like that.
So if it sees a lot of certain
kinds of patterns, it can do this.
Sometimes indefinite knowledge.
Just simple sentences like John may have
gone to the hallway or the bathroom.
Is he in the bathroom?
Maybe.
So, super simple stuff like that.
Basic induction, positional reasoning.
And this one is actually a little harder.
You basically have a bunch
of inputs saying the castle
is north of the beach and
the beach is east of the desert.
Now John moves west and
north from the beach.
Is he south of the desert or
something like that, probably doesn't
make sense but you get the idea.
So those kinds of reasoning, were a little
hard but it turns out there's another
version of this data set where you can
sample 10,000 examples instead of 1,000.
And then most of the models can
also solve positional reasoning,
it's just a matter of how
many examples you've seen.
So in many ways, this was encouraging.
Briefly, and
then we've moved onto real data sets.
But that use real kinds of real language.
The agent motivation here
is fairly simple too,
we basically showed a lot of examples of
somebody eating or drinking something.
And then you asked,
why did they drink something, and
it's because they were thirsty.
And so
thirsty is sort of the simple answer.
So again, very simple kinds of patterns.
But again, interesting and
necessary conditions.
If you can't even solve that with
your deep neural network model,
you will never get to a real question
answering system either that can solve
all kinds of more complex
types of reasoning.
What's more interesting is when we
actually applied it to real sentiment.
So here, the question is
essentially always the same,
you could almost ignore the question
vector and just have a zero,
it adds essentially just some
bias weights to the model.
But what was cool is
that model actually got
the state of the art on sentiment analysis
on the Stanford Sentiment Treebank.
And it's the same architecture.
But again, sadly,
it's not this same exact model.
So the hyperparameters here are different.
And one such hyperparameter is for
instance, how many times do you need to go
over the input before you wanna
predict the final answer?
And sadly, you get different state of
the art results depending on the tasks.
So for sentiment, the best number for
fine grain sentiments so very negative,
negative, neutral, positive, very positive
classifications for each sentiments.
The best, and when you allow
the model to go over the input twice.
But, for various of types of reasoning,
such as reasoning that requires
three different kinds of facts.
So, John went in the hallway,
the hallway is in the house,
the house is in this area.
Is John in this area, yes or no, right?
You now need to know and
go over multiple facts.
Or the simple examples,
like John drop the football there,
where is John, so that those kinds of
transitive reasonings you can make.
You can create artificially sort
of transitive reasoning chains,
that will require multiple passes, and
this also shows here that in theory,
you'd only need three passes.
But in practice, the model hadn't been
able to perfectly pick up all the relevant
facts at the exact right pass
over the input the first time and
needed multiple times.
Needed to go over it multiple times.
And this is assuming you don't
give it the fact supervision.
There's actually another data
set version of the data set,
that tells you this fact is important and
the first time you go over the input.
And then this fact is important the second
time you go over the input, if you do
that, then you can get away with three
passes for these three-fact reasonings.
But without that supervision,
just training everything end to end.
Question, answer,
input triplets need to have 5 passes
to get very high accuracy here.
So why is this not
a task-dependent hyperparameter?
It is, it is actually
a task-dependent hyperparameter.
And we did here, based on
the development splits, the analysis and
found that the best hyperparameter of
number of passes for sentiment is 2.
That's exactly right.
So, the question is, in practice,
you would just, at training time,
adjust these hyperparameters and for your
trained data set and identify, based on
your developments, what the best type of
parameter is, and that's exactly right.
So at least you don't have to,
for a variety of different tasks,
think about all the different kinds
of architectures that are out there.
But you still have to run some
hyperparameter search on what the best
type of parameters are for
this architecture.
Yeah?
So why is there no result here for
5 passes?
Because compute time is costly and
it already went down after 3.
And the probability, I guess our estimates
of that, it would magically go back up or
very, very low, so
we just didn't run the experiment.
All right, so now,
let's look at a couple of examples
of this attention mechanism for
sentiment analysis.
So we now, here, have a couple of
examples that even the dynamic memory
network got wrong when we only allowed
it one iteration over the inputs.
And what you see here is
basically a coloring scheme.
And the darker the color is,
the larger the attention weight that
g scalar is for that particular word.
And so, these are the kinds of examples,
also, that you now need to get correct
if you want to push the state of
the art in sentiment analysis.
And they're kind of interesting and
fun, actually.
So, in its ragged, cheap and
unassuming way, the movie works.
You can see it in the first pass over the
input, it just kind of pays attention to
all the things you would sensibly
pay attention to for a sentiment
analysis trained neural network,
which is a bunch of adjectives, right?
Ragged, cheap, unassuming,
a little bit of way, and so on.
Now, on 2 passes,
the model actually is not quite
certain where to pay attention to.
In the very first pass,
a little bit of cheap, unassuming, way,
somewhat oddly the, and works.
But in the second one, it sort of now
takes into consideration the context of
that whole sentence, and really increases
the attention to the movie working and
being sort of unassuming, which is less
negative than, for instance, just cheap.
And now correctly
classifies it as positive.
Yeah?
>> Why can't it do that in one pass?
Like see all the words once and
process that?
Why does it need multiple passes?
>> So the question is,
why does the model need multiple passes?
Why couldn't it just do it in one pass?
I guess the trouble is that, basically,
as you go from left to right,
and this is, in some ways,
what we think is the reason, and
the intuitions that we
used to build this model.
But I can't definitively tell you that
that is exactly why it cannot work.
It works, so, it works on 50% or so
of the cases, it gets it perfectly right.
Now, these are just examples.
It didn't get it right on 0 or 1 pass, and
basically the difference here is 0.6.
So in a small subset of the sentences,
it could only get it
right on multiple passes.
So intuitively here, what's happening is
you actually agglomerate all the facts.
And now, with that global viewpoint
of the sentence, you can now go back.
And having this m vector,
you can now use the m vector to pay
attention to every single word out there,
and you can realize, with that, that maybe
some words are more important than others.
So, that's intuitively what you can do.
If you only go from left to right once,
then you cannot really incorporate
the global information at every time step.
You can only take the information
you've got from the left, or,
if you have a bi-directional one,
from the left and the right, but not sort
of the global picture of the sentence.
All right, here's another fun example.
The best way to hope for
any chance of enjoying this film
is by lowering your expectations.
In the beginning here,
it basically pays a little bit of
attention to a lot of different things.
But everything pales in
comparison to the second pass,
where it really focuses
on the expectations,
a little bit that they're lowered, and
realizes this is actually negative now.
And when I say realizes,
I'm anthropomorphizing a little bit here.
That classifies them correctly.
Yep?
The color scheme is the same inside each
plot, but not across different plots.
That's a great question.
So, does the attention converge,
or does it shift again?
We've noticed it's converging in a lot
of cases, but we also didn't run it for
ten passes or something.
It might eventually explode or
do something crazy, I don't know.
But in most cases it does well, but
then we also notice that it
actually deteriorates sometimes.
So there are some cases, clearly,
where it then also deviates again
from what it should have done.
I'm sort of over-, again,
anthropomorphizing my model here
a little bit, but overthinking it.
Are the weights all summing up to one?
For this model, they do not, no.
It's just sigmoids at every time step.
Great question.
So do we share weights of how we
compose information at each pass?
So, yeah, so we have these different GRUs
that each, as we go over the inputs,
and it's actually a hyperparameter
that we evaluate it on.
And it works slightly better across
several tasks to have separate weights for
each time you go over the input.
My hunch is that is a balance between
how much training data do you have.
If you have enough training data,
it's better to have more parameters there.
And if you don't have
enough training data,
you might wanna share each pass weights.
Great question.
All right, second-to-last example.
The film starts out as competent but
unremarkable, and
gradually grows into something
of considerable power.
Again, focuses first as competent,
on competent, and
a little bit out of everything, and power,
and then really hones in on the power.
And now, the last one here,
I actually like a lot, which is,
my response to the film is
best described as lukewarm.
So every normal sentiment algorithm
would overindex, just like this one,
on the first pass, on best.
Cuz best, if you run a simple unigram,
base type model, Get very,
very high certainty that best correlates
very much with a positive sentence.
But in the second pass it actually
lowered it and our hope here is that it
realized actually used here as an adverb
and just best describing something.
But what it's actually
describing is lukewarm, and
then correctly classifies
it as a negative.
Now, last task is part-of-speech tagging.
The difference here is instead of
triggering the answer module only
at the end of the episodic memory module,
we actually trigger it at
every single time step.
So at every single timestep you classify.
Part of speech tags.
And when you combine a couple.
So one thing we don't,
that's not mentioned here.
Is that we actually have,
I think, two different models.
And we ensemble to get
to the state of the art.
But really, personally,
who cares about 0.06
improvement on a task like
part-of-speech tagging.
But it's good to see that it can very
accurately also predict sequence tasks.
That's sort of the main
take away message here.
Now, in the interest of time, I'll skip
the live demo, and go to another fun fact.
Or a fun aspect of this model.
Which is,
we had a new researcher join our group.
And he had a vision background.
And so, he said,
well there's this cool new VQA,
visual question answering dataset.
Can't we use this model?
Cuz I was all like, yes!
A general model.
It's so great.
Everything is question answering.
I was super excited.
And so, he basically said,
I'll replace the input module with
one that will give us a sequence
over Images, image blocks.
And he was also just new,
we had implemented at this
time everything in Torch.
We've since moved away from Torch and
then now came back to it through pytorch,
but different story.
And basically replaced
the input module and
checked if we can actually run visual
question answering with this architecture.
So what is visual question answering?
Basically, same idea.
Input, question,
answer as training input, but
the input is now a picture,
an image instead of a sequence of words.
So the kinds of questions you might
have here is what kind of tree
is in the background.
And the answer should be palm.
And another simplification here.
In this particular dataset, the answers,
you can get to way above state of
the art if you only ever predict a single
word instead of a sequence of words.
Because most of the answers
are just single words.
So instead of running a fancy GRU for
always a single time step.
You just classify a softmax right away.
And you give the inputs that you
would have given to the GRU,
directly to a single softmax layer.
Now, how does this input module change?
We won't be able to fully appreciate
this figure, to be honest,
but we can get some intuition.
We've no convolution networks,
but we don't know this exact
convolution of networks which
has a lot of bells and whistles.
On top of it essentially,
intuitively what's happening here is that,
the convolutional network will give us
a feature of vector representation for
every block of the image.
So in the end that every image that we
get as input will chop into 14 by 14 and
a grid of 14 by 14 and
we have a feature vector for
every one of these 14 x 14 grids.
How's that train?
So you can actually backpropagate
everything jointly, as well.
But as I mentioned in the beginning of
multitask learning computer vision is
further ahead in that sense than
natural language processing
because most people start their
convolution work from a pre trained.
Convolutional network that
usually is trained on ImageNet.
So is there yet a project that answers
questions based on both images and
text, and the answer is there are some
small datasets, but no dataset that.
I personally find it exciting enough
to,have started working on it.
It's really a dataset
problem at this point,
until we have a very compelling dataset
where you collect hundred of thousands,
ideally questions that you really
need to have both an image and
some text about the image to
answer the questions from.
It's tricky,
cuz a lot of times there's some overlap.
If you take the news, for
instance, sometimes news images.
I thought about this for awhile, and
thought about collecting
that kinda dataset.
But a lot of times news images just show
sort of some general thing that
is barely related to the content.
And so, it's non-trivial to find a good
data set where that's really necessary.
Paintings?
So the question is, will you be able to do
this kind of model with paintings, too?
And I don't see why not.
I mean unless they're super abstract and
you have to really interpret a lot.
But as long as they're realistic paintings
where you actually see objects I think
that as long as they're into trained
data it should do reasonably well.
So let's assume we have a feature vector
for every single region of the image.
What we're going to do is we're just in
a snake like fashion lay them out and
now have a sequence.
And now this sequence is given to
again bidirectional GRU block and
then the final feature vector that every
time step is just a concatenation of
the forward and the backward or
the left to right, and
right to left hidden states of
these by direction you use.
So essentially we replaced word vectors,
with feature vectors
from regions of images.
Now, what was amazing is that literally,
on the second experiment we ran,
we got to state of
the art on this dataset.
And we didn't really change the episodic
memory module code, the question code.
Just the input module changed, and we got
state of the art on this data set, and
what's even more fun is looking now
at this attention visualizations.
Again, it's unsupervised,
the model is never given inputs on
to where it should pay attention to,
to answer a certain kind of question,
so here we basically visualize this, so
this is kind of the equivalent plot to
what I showed you about the sentiment.
But the difference here is,
instead of being very dark,
Instead of being very dark blue,
it's white.
So the whiter it is,
the higher the attention score here is for
that region of the image.
And so, when we ask what is
the main color on the bus,
it actually literally pays attention to
that bus, and then gives the answer blue.
And so, we can ask what type of
trees are in the background.
Pays attention to the background and
trees.
Answer's pine.
How many pink flags are there?
Pays attention to the pink flags and
gives the answer, two.
Now, in general, number questions
are actually not that great.
So you can see here that, while this is
sort of the, close to the best model.
None of the models actually do really
great on numbering because the attention
mechanisms that all these different kinds
of models are using are very continuous.
All we're doing is as we're
agglomerating facts into this GRU,
we have high attention score and so
we're agglomerating that region.
But these models don't
have a discrete sense,
like this is a discrete object and
this is another discrete object.
And so, also if you have 50
objects it would never give you
the answer 50 because it can't count
in such a fine grained way and
of course, we have the problem that,
it's a classifier in the end and
so if it's never seen 39
objects at training time,
it cannot produce the number or
the answer 39 as a class.
So there's actually still problems with
this but in this particular data set
most of the number of objects
actually are relatively small.
Is this in the wild?
It's kind of interesting actually.
It pays attention to the man-made
structure, the house or barn or
whatever it is in the background here,
and the answer is no.
And so, I was still pretty skeptical in
the beginning of why it's doing so well.
But through these
attention visualizations,
I actually felt much
better about this model.
It's really clearly learning
something about this domain.
So again, sometimes,
you can overinterpret too much.
So who is on both photos?
It's not like the model actually gained
an understanding of that there are two
photos.
And then, captures,
is their face in the same person and both.
The majority object that you would
answer to a who question on this is
just a baby girl, so it just says girl.
This one again is very fun.
What is the boy holding?
Actually, literally learns to
pay attention to the arm and
then the surfboard and
gives answer surfboard.
So at this point I felt like,
this is more than just sort of learning
facts from just the language itself.
It really takes into
consideration image and
this one here's actually
another fun example of that.
Cuz there are some baselines where people
compare on just looking at the image and
answering, ignoring the actual question.
It's like here's an image,
what's the answer?
[LAUGH] And that also does 28% of
the time does the right thing.
So it's just their certain patterns when
you ask a question about this image,
what are they gonna ask and
it's capturing that sort of baseline.
In some ways,
even is if you just look at the question,
what is the bar holding in almost half
of the cases also gets it right not
having to look at the image at all.
>> [LAUGH]
>> So
sometimes we glance over these tables, but
it's important in both your projects and
that's what we'll do when we grade them.
Actually you've really critically
questioned what's going on in those
tables.
So really, what this model has been able
to do when this data set first came out,
we combine the two, question and images.
Read just only 4% or
less better than just the question alone.
But then this model does around
8% better than this model,
and around 12 or so
than just looking at the question alone.
And this is a good example of
where it actually took the image
into consideration namely, the question
here is what color of the bananas?
And if you just look at the question and
the default answer that you'd probably
be pretty good estimate would be yellow.
But in this particular Image the bananas
that it's paying attention to are not
quite ripe yet and
it learns to give the answer green.
Last one here is like what's the pattern
on the cat's fur on its tail?
Actually pays attention to the tail and
says stripe.
So some of them are pretty incredible.
And so, we then had put together
a demo to play around with it and
these were eight questions,
that we asked this demo.
And I was actually kind of
surprised how good it was.
So what is the girl holding?
A tennis racket.
What is the girl doing?
Playing tennis.
These are kind of simple.
And of course, it has to have seen
these kinds of answers before.
It has to have seen
pictures from that domain.
Is the girl wearing a hat?
That's actually,
a journalist had asked this.
And I was already coming up with excuses,
because the hat's sort of black and
it's a black background.
But then I got it right.
And then,
what is the girl wearing, shorts.
And what's interesting also, when you ask
what it's wearing, it says shorts, but
when you ask what color's her skirt,
it actually says white.
So it's kind of an interesting.
Robustness in some ways to the questions.
And then, I ask what color's the ground
and said brown and then I was like,
well the brown's the majority color and
so I asked what color is the ball.
And actually got it right despite the ball
being a very small part of the image.
And so, eventually the way I found it,
I broke sort of this demo,
was I asked,
is the woman about to hit the ball?
And it said, yes and then I asked,
did the woman just hit the ball?
And it said, yes again,
and I was like, all right,
that was the last one, but again it
boils down to having seen enough times,
of certain angles I guess of the arm, and
then that kind of question, and so on.
So I don't think it's something in theory,
it could never pick up, but
it just didn't have enough training data.
So in summary, I hope I could show you and
motivate you and excite you for
your PA-4 and the various question
answering projects that you're working on.
Cuz in the end, question answering is
really one of the most interesting tasks,
I think, in natural language processing.
And a lot of the tasks that other
task that we looked at in the class.
you could incorporate, I encourage you
actually to think about your projects in
PA-4 and extensions, and like could you
incorporate other kinds of tasks into
your data set and then see what happens
when you try to train them jointly.
The dynamic memory network can quite
accurately solve a whole variety of
different QA tasks, but as we'll talk
about next week's lecture, in one of them,
there are actually also extensions to this
where you can do dynamic generation of
answers and pay co-attention of
the inputs and the question jointly.
So still more work to be done on it.
Thank you.
