[MUSIC]
Stanford University.
>> Okay, hi everyone.
So back with lecture 14, so
in today's lecture what I'm gonna do
is introduce really the last
sort of major architecture for
neural networks that we're
gonna teach you in this class.
And so that's going to extend
beyond what we've seen with
both recurrent neural networks and
convolutional neural networks.
And today,
what I'm gonna start to look at Is having
tree structured recursive neural networks.
So this is a topic that is very
dear to both me and Richard.
It's sort of dear to me as a linguist
cuz I sort of believe that
languages have this basic tree structure,
as I'll explain in the coming minutes.
And it's dear to Richard because
it's what his thesis was about.
We'll talk [LAUGH] about
some of that work later on.
So, there's some new stuff here, there's
some stuff that in some sense is the same.
So, we kind of adopted the name
because of neural networks to refer to
tree structured neural networks.
But if you think about it, recursive and
recurrent are kind of sort of come
from the same Latin root, and
they're exactly the same
through the first five letters.
And so, in some sense, it's kind of a form
of recurrent network that is now done
over a tree topology, as I'll explain.
And so, more recently,
people have commonly referred to them as
tree RNNs, sort of emphasizing that more,
what's different is the geometry
of what you're dealing with.
And in the course of that I'll talk
also a bit about constituency parsing.
So, I'll start off with
some of the motivations and
looking at how these kind of
models can be used for parsing.
We'll have the research highlight and then
I'll go on with some other stuff with some
more applications and looking at some
of the sort of better architectures
that you can then build for
tree recursive neural networks.
Before getting under way
just quickly reminders, so
there's some staffing
complications coming up so.
For Richard,
he's gonna be away on Tuesday.
So, his office hours are gonna
be after class on Thursday.
So, come along then to talk to Richard.
Conversely for me, I also have a couple
of irregularities in my schedule.
So for tomorrow but the sort of,
the morning SCPD slot,
there's a linguistic's faculty meeting.
So, I'm gonna have to move that
to earlier, to nine to ten.
The afternoon will be the same as
usual for our projects officer hours.
But then next week,
I'm going to be away at the end
of the week, Thursday and Friday.
So, I'm gonna have to
sort of reschedule that.
And I'm gonna move the afternoon
slot until the following Monday.
So, look on the calendar and
get straight when those times are.
But we do still really want people
coming and chatting to us and
the various other final project
TAs about their projects.
Assignment four.
Make sure you're getting something
working using a GPU for our milestone.
We're seeing a lot more people
using GPUs and that is great.
Any problems stop by,
one of the TA's office hours or
coding sessions and get both or
messages and get that sorted out.
And then, just in general for
final project discussions, please come and
talk to us about your projects.
And we've sort of regarded as just really
really important for the final project.
That you actually have something that
you can run by the end of this week.
We hope everyone doing their final
projects is in the position where
they have some data ready to go and they
can run some kind of baseline this week.
Cuz if you aren't in that position,
it's pretty marginal.
As whether that things could
possibly come to a good conclusion
given the time available.
Any questions about that,
talk to your project mentor.
Okay, let me go on, so
I just thought I'd start with some
general remarks about sort of what
kind of structures that we put on,
language, when in some of these,
NLP and deep learning applications.
This is just a fun picture for
the sake of it.
But over at CMU they actually have this
lovely artwork which is the bag of words.
And you have the bag with words inside it.
And down on to the floor here,
have full and
the function words they're
outside the bag of words.
So, one model of language
that ends up being used
quite a bit in deep learning and sometimes
very effective as we've discussed
is you have nothing more
than a bag of words.
And for some kind of semantic
similarity kind of objectives.
The sad truth is that our state of the art
isn't really beyond a bag of words.
So that people have used
clever bag of words models.
Things like deep averaging networks and
get very good results.
So, there were papers
at last year's IKWEA.
So ICLR is this new conference
where a lot of deep learning
work appears--international conference
on learning representations.
And so,there was a paper last year from
people at the Toyota Technical Institute
where essentially their
result was that for
doing paraphrase detection things have
roughly the same meaning, that they could
get better results with a sort of
deep averaging network of a sort than
people had been able to get with any kind
of more complex structured neural model.
I think that's a sad
result not a good result,
but that's the current state of the art.
On the other hand, in linguistics and
in some computational models,
people attempt to use very
elaborated structures
of language,
which capture a huge amount of detail.
You can be at either of those extremes, or
you can hope to be
somewhere in the middle.
And if you wanna be somewhere in
the middle and have something that
roughly captures one of the most
fundamental properties of language,
it seems like you don't
wanna have word vectors, and
you want to have something
that can minimally capture
the main ideas of the semantic
interpretation of language.
And the idea I want to get at there,
is it seems the fundamental notion for
human languages is that we have this
idea of composition where we can put
together the meaning of phrases.
And so, we'd like to know about larger
units that are similar in meaning.
So, if we have two pieces of text,
perhaps paraphrase for example,
the snowboarder is leaping over a mogul or
a person on a snowboard
jumps into the air.
It sort of seems the first essential
point that a human being would notice and
think is part of the solution here
is that person on a snowboard
means roughly the same as
snowboarder that it's a paraphrase.
And we can do that because we can take the
sequence of words, person on a snowboard
and say that's a phrase and
we can put together a meaning for
that phrase even if we
haven't heard it before and
know that it's got this meaning that's
roughly the same as snowboarder.
And so essentially,
that's what human beings do every day when
they're listening to their friends, right?
Every day, people are saying
to you different sentences.
They might first off just say,
how's it going?
And you've heard that 10,000 times before.
But as soon as they get a bit
further into describing their day,
they're gonna be saying sentences that
you've never heard before, and yet
we can work out what they say.
And we're not working out what they
say just as a bag of words of saying,
mm, this is about a party.
We're actually getting the details of
what they said and who said what to who.
And we're doing that by starting off
by knowing the meanings of words and
then doing a semantic composition where we
can put smaller units into bigger units,
and work out their meaning.
And so we'd like to have
models that can do the same.
So at the very top level
this is compositionality,
is to how you can put pieces
together into bigger parts, and
then understand what
those bigger parts do.
I think it's a problem that's especially
prominent when you start thinking about
human languages.
But I'd like to suggest
this is not only a problem
that has to do with human languages.
In many ways it's sort of a more
general cognitive capability.
So that if you have something
like a complex picture,
well that's also something that
has a compositional structure.
So that you have pieces of stuff that you
have parts of a church that go together.
You have people that go together into
a crowd or in front of this church.
And then that also has a kind
of a compositional structure.
So in general, it seems that for
language understanding in particular, but
also for other kinds of artificial
intelligence, that we need to have models
that have this kind of capability for
semantic compositionality.
You can put together smaller
pieces into larger pieces and
work out the meaning of
those larger pieces.
And what I'd like to suggest is our
tree recursive neural networks.
One model that you could think about as a
good model for understand compositionally.
So there's this general idea
of sort of taking parts and
putting them together into bigger
parts and understanding their meaning.
There's a notion that's related to that,
which goes slightly beyond that
which is the notion of recursion.
So by far the most famous
linguist is Noam Chomsky.
And so in some of Noam Chomsky's
recent work with colleagues
they've tried to advance
this picture at in terms of.
I mean, it's for
50 years it's been Chomsky's position that
humans have this special innate born with
part of your brain structure ability for
human language that sort of sets
us apart from other beings.
And it's sort of actually seeing
that as having sort of specific
brain structure for language.
Not everyone else believes that.
But they've been trying to sort of
put forward these proposals as to
what is special in the sort of
capabilities of the humans for language.
And Chomsky and colleagues have wanted to
claim that really the defining property is
that humans have this ability, which you
see through language, to have recursion.
So that's just like in your CS class,
right, when you have things going back and
back to the same thing, looping over,
than you have recursive structure.
And human languages have this kind
of recursive structure, it appears.
So what's the idea of that?
That if you have the man from
the company that you spoke with about
the project yesterday that what we have
here, I think might have, no that's great.
What we have here is sort of recursive
levels of the same kind of structure.
So that project is a little noun phrase,
the company that you spoke
about the project yesterday.
That's a bigger noun phrase that contains
the smaller part noun phrase, the project.
And then the man from the company that
you spoke with about the project is very
that's an even bigger noun phrase that
contains my other two nouns phrases.
So human languages have
this nesting of structure.
Of where you sort of have the same units,
like noun phrases and
clauses that will nest inside each other.
And you can nest them deeper and
deeper, and that gives us the idea
that recursion is a natural thing for
describing human languages.
And that's essentially the sort
of basis of Chomsky's claim.
Now, are human languages recursive?
I mean, cognitively, that's actually
a little bit debatable, and
there are active debates on
psycholinguistics and this kind of issue.
I mean, it's sort of complex because,
really,
as soon as you're thinking sentences
are only gonna be some finite length.
No one's gonna say a sentence
longer than 300 words.
You can never sort of prove that
things are fully recursive,
because there's sort of a maximum depth
as to which things are gonna be embedded.
And actually, there's slightly
more to it to it than that.
It's a well known psycho-linguistic
observation that actually
having things embedding in
the middle of the sentence.
So I sort of deliberately the project
not right at the very right edge,
but had something come after it.
So when you're sort of having
more central embedding,
that tends to be disfavored
in human languages.
So although you get a lot of embedding,
most of the embedding that you get tends
to have more of a chaining structure,
where you have a right
branching kind of structure.
And to the extent the structure
is purely chaining,
then you don't actually need
to have full recursion to
describe it because you can think of it
more as a kind of an iterative sequence.
Anyway, some cognitive arguments
there cognitive science arguments.
But nevertheless, if you're sort
of wanting to give a neat sort
of natural description
of natural languages.
It's sort of basically you end
up describing them recursively.
Cuz what you want to say is well,
there are noun phrases which can expand to
various things like a short noun phrase
like the man followed by a prepositional
phrase which has a preposition
from followed by a noun phrase.
And here's this big noun phrase.
And inside this noun phrase
it's got a relative clause.
And inside that it's got other noun
phrases like you and the project.
So you kind of get these levels inside
each other heading down recursively.
And you can and indeed people do,
in things like news live sentences embed
them even more deeply than my example.
And so thinking about these kind of
tree structures which have embedding
inside them where we can have noun
phrases with noun phrases inside them.
And so a natural way to think
about language structure and
to think about things like how
we disambiguate sentences.
We sort of talked before about
having ambiguities of attachment and
we talked about dependencies before.
The other way to these this and
think about them is to have these kind of
constituency or
free structure representations.
Which in computer science terms correspond
to context-free grammar representations.
And then we have noun-phrased units and
we can be saying with a spoon
is modifying the verb eats.
It's a child of the verb phrase
constituent or the with meat
here is a similar prepositional phrase
with the noun phrase inside it.
But it's modifying the noun phrase
spaghetti to build a bigger noun phrase.
And so as soon as we started using these
kind of structures in a context-free
grammar kind of structure,
we have the ability for recursion, and
human languages seem to indicate that.
And we kind of want to refer to
these units when we do other tasks.
So next Tuesday, I'm gonna talk
about co-reference resolution,
which is how you refer back in the text or
to the environment refer back to entities
that have already been established.
And that sort of can be thought of as
sort of picking out pieces of structure
in terms of this kind of constituency,
compositionality.
So John and Jane went to a big festival.
They enjoyed the trip and
the music there,.
So they refers back to something, and
it seems to refer back to this noun
phrase, John and Jane enjoyed the trip.
So the trip's a noun phrase, which seem
to refer back to going to a big festival.
And the music there, the there is
again referring to this big festival.
Okay, so and, finally,
having these kind of grammatical
analyses of sentences is
clearly better for some tasks.
It's capturing a very powerful prior of
what human language structure is like,
and is then useful for understanding and
interpreting human languages.
So in the start of the course,
we just had word vectors, and so
we had word vectors for things like
Germany, France, Monday and Tuesday.
And we sort of were able to capture word
semantic similarity in terms of our word
vectors.
What we'd like to be able to do is say,
no.
As well as just single words,
we have larger constituents.
So we have noun phrases like the country
of my birth or the place where I was born.
And we'd also like to understand the
semantic similarity between those phrases.
And the idea that we're going to
develop to answer how can we do
that is that what we're gonna do is say,
well, what we'd like
to be able to do is take bigger
units of linguistic structure,
and also work out how to calculate
their meaning as vectors,
and place them in exactly
the same vector space.
So we're sort of hoping we
can take bigger phrases and
say let's just stick those into
our vector space as well, and
also represent their semantic similarity
as a kind of vector similarity.
Where, of course,
this example's only two dimensional, and
in practice, we'll be using 100,
200, 1000 dimensions.
Yes, question?
>> [INAUDIBLE]
>> Okay,
so the question is why would we
want to put them in the same
space as the word vectors?
I mean, that's not a necessary thing.
Obviously, you could say, no,
that's just not what I wanna do.
I'm gonna have word vectors in one place
and phrase vectors in another place.
The reason why a lot of the time
that seems a good idea is that,
I mean, individual words can
capture a lot of meaning.
And, in particular, they can sort of
bundle up a bunch of semantics that's
often equivalent to things that you can
say in other ways with a phrase, right.
So I guess I had the example right
at the beginning where I had
a person on a snowboard, and snowboarder.
It seems like, well,
those should be counted as paraphrases and
mean the same thing.
And I'm only gonna be easily able to
capture that kind of similarity if I'm
using the same vector space to represent
both phrase meaning and word meaning.
Okay, so that's our goal.
And so the question is how
can we go about doing that?
And for words what we did was
had a big lexicon of words, and
said let's learn a meaning
representation for each one of them.
And we were able to do that.
I mean, that's clearly not possible for
phrases, like the place where I was born,
because we just have an infinite number
of such phrases as they get longer.
And so we can't possibly calculate and
store a vector for each phrase.
And as we started to see last week is even
for words, in a lot of cases, it seems
like it might actually turn out to be
sort of unappealing to store a vector for
every word, especially when they're words
that have some morphological complexity,
like snowboarder, which is snow, board.
And so we started to talk about,
even words, how we might want to
compose their meaning out of smaller
units as part of some neural network.
So, again,
that's what we're gonna want to do for
having these bigger phrases in language.
So we're gonna have something
like the country of my birth.
And what we'd like to be able
to do is semantic composition.
We'd like to be able to use
the principle of compositionality,
which is sort of a famous thing
from philosophy of language or
semantics, which is saying you can
derive the meaning of a phrase or
a sentence by starting with
the meaning of its words.
And have some kind of,
then, composition function
that you can then calculate meanings
of bigger units as you combine things.
So what we'd like to be able to do is
put together my birth as two words, and
have a meaning for that phrase,
a meaning for the phrase, country, and
keep on calculating up, and
get some meaning for the whole phrase,
which we could then represent
in the vector space.
Okay, so if we build models of this type,
we can potentially hope to do two things.
We can potentially use them both as
models that will build structure,
that will actually build
sentence structure as they go.
And they will also build semantics, that
they will build a meaning representation
for these phrases as they build up.
So the general picture of
what we're going to want to
do is we're gonna start off with a bunch
of words, and their word vectors,
which we'll look up in the lexicon.
The cat sat on the mat.
And then we're going to start
building phrase structure.
So we're gonna say, that's a noun phrase.
This is a noun phrase, prepositional
phrase, verb phrase, build a sentence.
So we'll have a kind of a syntactic
phrase structure of the tree.
And then we'll, using that, kind of
build up the semantic representations.
And for that case,
I just sort of knew what the right phrase
structure I wanted for the sentence was.
And just sort of drew in those nodes and
calculated their semantics.
Well, one of the questions is how can
we calculate that as we go along?
And I'll come back to that in a minute.
Then before doing that,
I just wanted to spend
a couple of slides just sort of
going back over the connections and
differences between the model types
that we've been looking at recently.
So up at the top half, we now have
our tree recursive neural network.
And in the bottom part,
we then have our recurrent neural network.
Now, in some sense the kind of linear
sequence models that you get for
recurrent networks, are kind of sort of
like a sort of a limit case of a tree,
so that if you sort of stare
down from about this angle,
what you're looking at actually
looks like a tree, right.
If you sort of tip it,
it's sort of like this tree, but
it's a tree that's always sort of
right branching down to the right.
And, actually,
it turns out that quite a lot of English
structure is right branching down
to the right in most sentences that you
sort of get these pieces of constituency
where you get left branching like
the country in this example.
But if you look at
the details of these two
models the details of the model
are kind of different.
Because in this model we're
exclusively sort of building upwards.
We're taking smaller pieces of structure
and then computing a representation for
a larger piece of structure
that they can pose into.
Whereas this model is sort of
actually a kind of a funny sort of
mixed model when you think about
it comparing it to a tree,
cuz this sort of the word vectors
are going up to compute something but
then simultaneously,
we have something going down the tree.
And that's sort of the idea that
Richard was mentioning last time.
How the recurrent models
are really sort of
capturing representations
of whole prefixes and
you're not getting any representations
of smaller units than that.
There are a couple of other pluses and
minuses to think about.
So the problem of
tree recursive neural nets is that
you have to get a tree structure.
And so this is actually a huge problem.
I mean if I had If I admit
right at the beginning tree
recursive neural networks
have not swept the world.
There's some really good linguistic
reasons to like them and
we'll say some stuff about them.
But if you just sort of go out
on arXiv and start looking at
what people are using in neural networks
for language you have to look for
a while to find people using
tree structured models, right?
There's ten times as much use of the LSTMs
that we've talked about previously.
And a big part of that
reason is because the user,
tree recursive model,
you have to have a tree structure.
And for
some of the things that we've talked about
I think you can sort of immediately
get a sense of why that's problematic.
Cuz putting a tree structure
over a sentence is making
deterministic categorical choices as
to which words are going together to
be constituents while other words aren't.
And anywhere you're making categorical
choices that's a problem for
learning a model simply by
running back propagation.
And so that sort of puts
complexity into these models.
It also means that they're kind of GPU
unfriendly, cuz there isn't just this
sort of simple lock step computation,
like an LSTM gives you.
So LSTMs have this very simple structure,
cuz it doesn't matter
what the sentence is.
The structure is always the same, right?
You just have that same sequence model
that chugs along from left to right,
which makes them very
computationally appealing.
But of course, they have the disadvantage
that they're not actually representing
any of the structure of the sentence.
And if you want to get back to my original
picture with the CMU bag of words, I think
there's just sort of a manifest sense for
human languages that if you just want to
have this sort of first cut roughly
the structure of human languages write.
What you gonna have to have is sort
of know which sub units of words
goes together to have
behaviours constituents and
the semantic parts out of which
bigger sentences are described.
Okay, so conversely we can also
think about the relationship between
the tree recursive neural networks and
convolutional neural networks.
So these central difference there is
that the tree recursive neural networks
calculate representations,
compositional vectors only for
phrases that sort of make sense,
that are grammatical,
which a linguist would say is part
of the structure of the sentence.
Whereas what a convolutional
neural network does is say,
okay let's just work out representations
that every pair of words,
every triple words, every four words.
Regardless of whether they
make any sense or not.
But again there's actually
an advantage to that, right?
That since you're not actually
having to make any choices, and
you just do it for every pair of words,
and every triple of words.
You don't need a parser, you have, again,
you're back to this sort of uniform
computation without any choices.
But it's not very linguistically or
cognitively plausible, I feel.
To some extent, I actually think recurrent
models are more cognitively plausible
as an alternative to tree structure
models and convolutional neural networks.
So the sort of picture is that for
the CNN, you're sort of making
a representation of every pair of words,
every triple of words, every four words.
Where as the tree recursive neural
network is saying well some of those
representations don't correspond to a
phrase and so we're gonna delete them out.
So that for
the convolultional neural network,
you have a representation for
every bigram.
So you have a representation for there
speak and trigram there speak slowly.
Whereas for the recursive neural network,
you only have representations for
the sort of semantically meaningful
phrases like people there and
speaks slowly going together to give
a representation for the whole sentence.
Okay, so how do we go about calculating
things in the course of neural network.
So the idea is when we wanna build
a representation of a larger unit
what we're gonna do is take
the representation of its children and
we're gonna have sort of binary trees for
what we're showing here.
We gonna stick them through
some kind of neural network,
and we're going to have
probably two things come out.
One is going to be of
a vector that's representing
what is going to be the meaning of
this larger unit if you construct it.
But if we'd also like to parse at the same
time and work out good structures.
We also wanna have some kind of score
as to say, is this a good constituent?
Because that will allow us to
build a parser at the same time.
So how we might we do that?
If we sort of start doing it at
the simplest way possible just using
the kind of rudimentary neural
networks that we've looked at.
This seems like the kind of
idea what we could build so
we could take the vector
representations of the two children
which might be just words or might
already be phrases that you've built up.
We could concatenate them to make them
into a bigger vector, have a linear layer,
multiplied by a matrix, added
non-linearity, put that throw a tanh,
and so we've just got the simplest
kind of single neural net layer.
And that will then give us our
representation of the parent.
And then we want to score that and well,
one way we could score that is just
having a vector here that we could then
multiply this representation by a vector
and that'll give us a score for phrase.
And doing precisely that was the first
type of tree recursive neural network
that Richard explored in his
work back in about 2011.
Okay, but if we have just this.
Then we're in a position
where we can use it
to parse a sentence with the tree
recursive neural network.
And so the easiest way to do that and
it's kind of similar in a way to what
we did with the dependency
parsers sort of three weeks ago.
Is to say what were gonna do is we gonna
run a greedy a parser we're going to look
at what seems best and make hard decisions
on each action and then proceed along.
So what we could do is we could start off
with the sentence the cat sat on the mat,
and what we're gonna do in
a sense is kind of like
what the first part of
a convolutional neural network does.
We're gonna take each pair of words and
we're going to calculate
a representation for that pair of words.
But we've got one other
tool at our disposal now.
That we're also calculating a score for
these combinations and so,
we can say that the thing that scores
best is putting together a phrase
is these two words on the left.
So, why don't we just hard
commit to those and say, okay,
we've got the cat as the constituent.
With this semantic representation and
so then at that point we can
for what to do next we have all
of the choices we had before and
additionally we have a choice that we
can put the cat together with sat.
So, that's a new choice that we can
evaluate and at this point we can say
well, looks like the best thing to
do is to combine the mat together,
cuz that's got a good score, so
we do that, and we commit to that.
Then we thought one new
thing we could try,
cuz we could have the mat
go together with on.
And we can look at that and we could say,
yeah, that's a really good thing to do.
On the mat,
that's a really good phrase to have found.
So, the neural network will
commit to that one, and
then we'll kind of keep on repeating and
we'll decide putting sat
together would be a good idea sat
on the mat that's a predicate, and
then we'll combine that together
with the cat and we're done.
And so we sort of greedily incrementally
building up parse structures
as we go along and working out
the parse structure of the sentence.
And so, at the end of the day, that's
a parse tree, and we're gonna have a score
for our parse tree, and that score for the
parse tree we're just gonna say, we made,
we've got a score for each individual
decision as we put two node's together.
And the tree has got a bunch of nodes in
it, and we're just going to sum those
scores of each node decision, and
that will give us the score of a tree, and
what we'd like to do is
find the very best tree for
this bunch of words, and
we've kind of approximated this by
doing this greedy algorithm where we
just committed to what looked like
the best constituent to build at
every particular point in the time.
And so, the final thing was then
sort of set up as a loss function,
with a sort of max margin objective.
Where you were sort of trying to
adjust the parameters of the model so
that you're maximizing the scores of
the sentences that you have found.
And then, you're considering what
the sort of structures you were finding,
what incorrect decisions you'd made
versus what the gold structure for
the sentence is meant to be in the sort
of gold structure in the tree bank
that tells you what are the right answers
for it, and in this kind of a model,
you'll sort of, in theory what you'd
like to do is to find the best tree for
each sentence, according to your model.
And then changing
the parameters of the model.
So the model thinks the best tree is
the correct tree in the tree bank.
And that's then gonna be
minimizing your loss.
In practice, finding the best
tree according to your model,
can require an exponential amount
of work and we can't do that.
And so we're just substituting in
this greedy finding of a parse,
that looks sort of good according
to our model, and using that
in our loss function right here, the sort
of score of the parse that we found.
There are obvious generalizations of this.
Rather than sort of keeping just one
really best parse that you're finding, you
could have a beam parser, and you could
explore some different possibilities.
And you could sort of
have a ten best beam,
and sort of explore a bunch
of parses to the end.
And sort of then come up
with a better estimate
of what is the best parse
according to your model.
But the central thing to be aware of is to
what you can't do is standard result for
parsing context free grammars,
when you have grammars with labels.
Like noun phrase, verb phrase,
and things like that.
Then what you can do is dynamic program,
context free grammar parsing,
and so you can then do parsing context
free grammars in O(n^3) time and
be guaranteed to have found
the optimal parse for
your sentence according to your grammar.
The problem is here, every time we're
putting together two constituents, we're
running it through a little neural network
and we're getting a vector out here.
And so for any way of putting
things together differently
we're gonna get different vectors up here
and so there's no substitute that if
you were actually wanting to guarantee
you'd found the best parse of a sentence
that you'd have to do the exponential
amount of work of exploring every
different possible way of putting things
together, which we don't want to do.
but in general you can work pretty
effectively by sort of doing fairly
greedy exploration, according, informed
by your model to find good structures.
Okay, so this is our overall objective
function and we want to be changing
the parameters of our model so
that it's wanting to choose this parse of
the sentence the one that's the same
as the gold parse of the sentence.
And so, the way we do that is again
by backpropagation algorithm.
So, we'd sorted from the recurrent neural
networks we had back propagation through
time where we're sorting of chugging back
through the time steps of your linear
model.
You can generalize that
to tree structures and
actually that was done by
a couple of Germans in the 1990s.
Goller and Kuchler then came up with
this algorithm called back propagation
through structure.
And in principle,
it's the same as the back propagation
we've seen again and again and again.
But it's sort of just
slightly more complex
because you have to be getting things
working over a tree structure.
It ends up that there
are three differences.
For working out the updates to
the matrix W in your neural network.
So, just like an RNN,
you're going to be summing up
the derivatives of the sort of error
signals that you get coming into W,
everywhere you see it
inside the tree structure.
Something that's slightly different
when you back propagate down,
essential back propagating
down a tree structure.
We then have to split the derivatives and
send them down both branches of the tree
structure to the next level below.
And then when you're calculating
your error messages,
you'll have an error message
coming in from the node above.
You want to be adding to it,
the additional error from the node itself.
And then, you wanna be splitting it and
passing it down to the two
nodes down below you.
In these slides,
there are then some slides that
go through that in more detail.
Summing derivatives of the nodes,
splitting the derivatives at each node,
add error messages.
And actually from Richard last year,
he wrote some Python code for
doing back propagation through structure.
I thought I wouldn't actually try and
explain in any more details than that in
class right now, all the details of that.
But you can look at these
slides on the website and
chug through it in more detail than that.
And I thought we could skip
straight across to Kevin,
who's gonna be doing
today's research highlight.
>> Okay.
Hi everyone.
I am Kevin and I am going to be presenting
deep reinforcement learning for
dialogue generation, which is a paper
by some people here at Stanford.
But some people elsewhere as well.
So the goal of this
paper is to train a chat
bot that can hold
a reasonable conversation.
And the authors approached this task
in the sequence-to-sequence framework,
where the input sequence
consists of a message, or
perhaps several messages
from a conversation.
And the output sequence is
a response to the message,
and that's what the chat bot will say.
So they use the exact same
encoder-decoder model, you saw for
machine translation last week.
You can train this model with
the exact same training objective,
which is maximum likelihood estimation.
So you find the data set of people talking
to each other, and you train the model by
making it assign high probability
to the responses people say.
So once you've trained a model like this,
it's kind of fun to have the model
talk to itself and see what happens.
So this is a real conversation
from the model in the paper.
So the first chat bot says,
how old are you?
And then the second one says I'm 16,
and the first one says 16?
And then things kinda fall apart.
>> [LAUGH]
>> So the second chat bot says I don't
know what you're talking about, the first
chat bot says you don't know what you're
saying, and the chat bots get
stuck in an infinite loop.
>> [LAUGH]
>> So if we look over this dialogue
we can sort of point to some problems,
that might be causing this issue.
The first one is actually
the response I'm 16.
Although it's kind of a reasonable follow
up to the question, it's not very helpful.
So it'd be maybe better to say something
like, I'm 16, how old are you?
And now you're giving more guidance
to your conversation partner,
in what you should say next.
The second issue here is this, I don't
know what you're talking about response,
which actually is more or
less a reasonable reply.
It still kind of makes sense,
but it's a very generic response.
And really the main issue here is that,
we're training our model to produce
sentences that have high probability.
But that actually doesn't necessarily mean
sentences that are good and useful for
the conversation.
So with the, I don't know what you're
saying example, it is high probability cuz
really no matter what you're
say to me I can respond with,
I don't know what you're saying and
it sort of makes sense.
So trained with maximum likelihood
estimation, the model thinks great
this is a good response, and we want some
different objective to train the model.
So that got a little bit messed up,
but the criteria we could think of for
training a good response is that,
it is reasonable so it makes sense.
But also that is non-repetitive so
we don't get in an infinite loop.
And that it's easy to answer, so
you say something a little bit
more helpful than just I'm 16.
And in this paper,
the authors come up with ways of
computationally scoring a response
according to these criteria.
So they end up with a single scoring
function that takes the responses input,
and returns some number indicating, did we
do a good job with this response or not.
And then they train a model to
maximize the scoring function,
using reinforcement learning.
So I'm not going to go into detail
on how reinforcement learning works.
But the main idea is that instead
of learning from an example, so
how a human responded to a message,
you learn from a reward signal.
So we start off with, as before,
encoding the message in a vector.
But now instead of passing in
a human-generated or a response
that a human said and try to increase this
probability according to the model, we're
gonna just leave the model to its own
devices and have it produce a response.
And then give it a reward signal,
which tells it, did it do a good job
with the response or not, which is that
scoring function I mentioned earlier.
So here it's negative, because I
don't know isn't a good response.
And through reinforcement learning,
the model will learn to not produce
these poor quality responses.
And so now on to some results,
how well does this work?
These first results
are quantitative results,
where the author showed dialogues
produced by the system to humans, and
had them say which of these
dialogues were better and
here a positive number means the
reinforcement learning system did better.
So you can see that humans thought
the reinforcement learned system,
was better particularly at making
messages that are easy to answer, and
also for the general quality
of a several turn dialog.
We can also have our chat bots
talk to each other again, and
here you see that it's doing a bit better.
So the first chat bot ask who old are you?
But now instead of saying I'm 16 only,
it also says why are you asking?
So it's kind of helping
the conversation move along.
But actually after a couple turns,
they end up in the same infinite loop.
>> [LAUGH]
>> So this just kind of highlights,
although reinforcement learning
is a useful technique,
it doesn't kind of
magically fix everything.
So to conclude reinforcement learning is
helpful when we monitor a model to do
something, beyond just mimicking
the way humans perform a task.
And it's been applied to
many areas beyond dialog.
So if you're interested, there's a lot of
new and exiting work in that direction.
Thank you.
>> [APPLAUSE]
>> Okay.
Yeah so, we'll have a bit more
reinforcement learning in this class,
including I think next Tuesday,
it might come up.
But maybe I should just
while we're on that topic,
advertise there newest Stanford CS Faculty
Emma Brunskill started work yesterday.
And in the spring she's gonna be teaching
a class on reinforcement learning.
So if you wanna get a good dose
of reinforcement learning,
there's an opportunity there.
Okay.
So what I wanted to do now was,
sort of show you a bit more about how we
develop some of the ideas of having
this tree-recursive neural networks.
I guess I haven't really shown anything
in the sort of quantitative results,
of show big results tables for
that simple recursive neural network.
But the summary of it was that, we could
do some sort of useful things with it for
learning about paraphrases and
getting syntactic structures right.
It sort of worked.
Able to publish a paper on it and
all of those good things.
It seemed like it wasn't
really fully adequate for
doing all the things that
you wanted to do, for
understanding sentences and
semantic composition.
And there are a couple of ways in
which that was true, it appeared.
And so for some of the later work, and
only some of which I'm gonna
be able to show you today,
we were then sort of starting to explore
better ways in which we could put sort of,
more flexibility or
better neural units into this sort of same
basic model of tree-recursive structure,
to be able to do a better job.
And there are sort of a couple
of ways in which it seemed like,
the model probably wasn't
doing what you want.
The first one, the no interaction
between the input words is,
I kind of think, a common issue that
happens with quite a lot of models,
if you have just a single neural layer.
We sort of mentioned this also when
we were talking about attention,
that if you just sort of
concatenate C1 and C2.
And put them through a single matrix,
multiply, and then a nonlinearity.
That you can think of the weight
matrix as sort of being just
segmented into two smaller matrices.
When one W1 same matrix multiplies by C1,
and the W2 matrix multiplies by C2,
and then you just sort of do
the element wise non linearity.
Which sort of means that the two words,
their meanings don't actually
really relate to each other.
And hat sort of seems bad and
I'll come back to that
as the sort of last
thing I touch on today.
But before we get to that one, the other
thing that seems kind of dubious here is.
For all semantic composition
we just have one composition
function which has one matrix.
So it doesn't matter whether we're
putting together an adjective and
a noun or a verb and its direct object.
Or even if we're putting together
the rest of a sentence with a period at
the end of it.
In every case we're using exactly the same
matrix multiply and saying, just multiply
by that matrix and that'll put together
the meaning of your sentence for you.
And that seems pretty wishful
when you think about it, and
so an idea that seemed
kind of a neat idea was.
Could we get something more powerful
by allowing more flexibility in
the composition function for
different kinds of syntactic
constructions in their composition?
And so that led to the idea
of Syntactically-Untied Tree
Recursive Neural Networks.
Which actually proved to be
a very successful idea for
building high quality parsers
that parsed very well.
And essentially what this model did was
sort of argue that there's a reasonable
separation that can be made between syntax
and semantics in the following sense.
That there's sort of basic
syntactic structure of languages.
So you have a noun phrase which
can have a smaller noun phrase,
followed by a prepositional phrase,
like the man at the lectern.
And that the prepositional phrase will be
a preposition followed by a noun phrase.
That kind of syntactic structure
can be pretty well captured by
actually a symbolic grammar.
So we assumed in this model that we
did have a symbolic context free
grammar backbone that was adequate for
basic syntactic structure.
But the problem for
sort of traditional NLP in linguistics is,
although such a backbone
is pretty adequate for
telling you the possibilities for
building syntactic structure.
It's not very good at working out which
structures that you should build or
what is the meaning of
different kinds of units.
So the suggestion is it's
perfectly fine to use discrete
categorical structures for
the syntax of a language.
But what we want to do is make use of
our soft vector representations for
describing the meanings of languages.
And so therefore we can sort of
start with that observation and
then build the kind of flexibility
of composition that we're wanting.
By saying well,
if we are sort of knowing about something
about syntactic structures
in a categorical way.
If I walk right over here, well we can
know the categories of the children,
so maybe this is a preposition,
and that's a noun phrase.
We can use our symbolic
grammar to sort of then say,
okay these will go together
into a prepositional phrase.
And so
since we know these categories here,
we can then also use those categories
to decide a composition function.
So we can decide, okay we're composing
together a preposition and a noun phrase.
So let's use the composition function,
that's the right composition function for
putting together a preposition and
a noun phrase.
And so here now on this side,
rather than just always using the same
W matrix for any cases of composition.
Now we can say, let's use the W matrix for
putting together a preposition and
a noun phrase.
And that'll give us this bigger unit which
will have some category according to our
syntactic grammar.
And then we're gonna be putting
together an A and a P1 and so
we'll be able to use the right composition
matrix to put together those categories.
And there's some other good properties
of doing things this way from
a practical sense.
Doing simple PCFG parsing with
the categorical grammar is fast because
we can do it just using the symbolic
categories and dynamic programming.
And then we only have to be
doing the sort of deep learning
on the structures that we know
are the ones that we want to build.
So it's actually we are using syntax
to guide what we build rather
than trying out every different way
of putting pairs of words together.
So essentially we're using
the syntactic model to work out
reasonably plausible structures for
the sentence.
And then we're building the semantic
combination for those structures.
And then using the semantics to do the
sort of harder decisions as to what does
this prepositional phrase modify and
things like that.
And so he called this result the
compositional vector grammar where it's
a combination of a probabilistic
context-free grammar.
Plus then using this tree structured
recursive neural networks grammar.
And in this class we haven't really
talked about sort of the whole
history of doing parsing for
natural language processing and
the kind of grammars that people built.
But in some sense you can think of
this as a generalization of the kind
of things that people have been
involved in for the last decade for
trying to improve the quality of parsers.
So the starting point is you can just
have a context free grammar parser.
And that works very badly for natural
language because you kind of can't do
a good job of dealing with all
the syntactic ambiguities and
deciding things like
prepositional phrase attachments.
So back in 2003, Dan Klein and
me sort of said well,
if we did some manual
feature engineering and
we kind of split categories, and
we had fine grained categories.
And so that we knew that it was not
just a prepositional phrase, but
a prepositional phrase headed by all of
our prepositional phrase headed by for.
We could actually just have a CFG parse
quite a lot better, and that was true.
Then following on from that, a few years
after that, Slav Petrov said, well,
maybe we could actually learn those
subcategories automatically, and
that could help things along.
And that did work, and
simultaneously there was a whole line
of work on doing lexicalized parsers.
Which sort of said, well a reasonable
way to represent the semantics
of a phrase like,
the person at the lectern.
Is to say,
what is the head word of that phrase?
It's person, and
just represent the semantics
of the person at the lectern
with the semantics of person.
And that was a useful idea,
To help pausing and
making dismbiguation decisions
because to some extent that's right.
But on the other hand your
losing a lot because your saying
the meaning of the person at
the lectern is just person and
you've lost all the other
words of that at the lectern.
And so, effectively for the CVGs,
we're trying to sort of extend
that further and say, we'll no,
rather than just having a sort of a finer
grains in syntactic representation,
substituting in the head word and
using it as a semantic representation.
We can actually calculate the semantics,
the meaning for a whole phrase, and
then use that for doing our disambiguation
decisions in semantic parsing and
that will be able to be more accurate.
And to a first approximation,
that actually works.
So here are some results from parsing.
So, this is sort of trying to parse for
context free grammar structures,
natural language sentences over a famous
corporates of
Wall Street Journal articles.
And what we're scoring here is
sort of an F1 measure as to
whether you're getting
particular constituents right.
So you're making constituecy claims
like that there's a noun phrase
from words three to 12.
And then that's either right or wrong.
And so you can see how there's been
a succession of people gradually getting
better at this task.
So, if you just sort of have a plain CFG,
your score is about 72%.
So how more kind of manually
feature engineered, whoops, sorry.
Manually feature engineered,
context free grammar was
considerably better, about 85%.
Some of the ideas of having putting in
lexical heads were even better, 87%.
The automatically splitting,
which sort of mixes syntax and
lexical information,
was even better at about 90%.
And, but by build, and
if you just have a plain RNN,
it's sort of not that great, because
if you already just have one W matrix,
you kind of can't model
a lot of composition.
But by having this idea of this
syntactically untied RNN where you can
learn these different composition
functions for different kinds of phrases,
that that actually worked very nicely and
produced a strong, well performing parser.
I mean, there are some
better parsing numbers where
people have done various kinds of self
training in data orientation, and actually
there's some more recent results since
2013 that I don't show in this slide.
But nevertheless, this sort of proved
a successful way to sort of build,
a sort of more semantically
sensitive parser.
In some sense, the interest of
this isn't sort of just that,
if can get parse structures right for
sentences.
So the biggest interestingly
different thing here is,
well actually we are computing a semantic,
some meaning representation for
each phrase
that gets back to that original idea
of understanding meaning similarities.
And that's just something that by itself,
that's sort of context free
grammar isn't giving you at all.
And there's sort of some neat
things you can see out of that.
I mean, one of the neat things you
can see out of this is just sort of,
you can observe how
the soft grammar learns
notions of where the information is,
and so what are head words and phrases.
So this is as it starts to put
together pairs of words that you can
see by the activations in the matrix,
as to where it's getting information from.
And so there's something you
have to know to interpret this.
So for training this model,
what Richard did was he started
off the matrices with sort
of identity initialization.
But they're sort of kind of two half
matrices with identity initialization
because this is sort of the part of the
matrix that's multiplying the left child.
And this is the part of the matrix
that's multiplying the right child.
So they were initialized with identity
initializations we've sort of spoken about
before that sort of has a similar effect
of allowing this sort of propagation of
information at the beginning of training
to the kind of thing that an LSTM does.
And so, if you're putting together
a noun phrase with a conjunction, so
this is sort of something
like the student and.
Well, it's correctly learning
that most of the information
is coming from the student, and
there's relatively little coming from and.
If you're putting together a possessive
pronoun and the rest of a noun phrase.
So it's something like his cat.
Most of the information
is coming from cat,
with relatively little coming from his.
And here are some other examples.
If you're putting an adjective
together with a noun, so
this is something like red chair.
It's learning with your gain kind of quite
a lot of information from both sides.
This is a whole adjective phrase and
a noun, so this is something like
extremely dark movie.
So you're again, getting lots
of information from both sides.
And there's sort of some structure
here that you can sort of see
the same dimensions seem to be marking
the kind of modifier meaning for
both the adjective phrase and
just the plain adjective.
So that's kind of cool.
The more interesting thing
then is to sort of say, well,
are we actually getting sort of a
semantics for these phrases and sentences
that is capturing semantic similarity in
the way I claim right at the beginning?
And actually that did
work reasonably well.
So here is one sort of test that
we did to try and illustrate that.
So basically we're saying okay, for any
sentence or any phrase we've calculated
a meaning of that phrase so that we can
sort of place into our vector space.
So just like for word vectors,
we can then say,
what other sentences are placed
nearest together in the space?
Cuz they should have the same meaning.
So if the test sentences, all the figures
are adjusted for seasonal variations.
The two closest of the sentences
in the Wall Street Journal corpus.
All the numbers are adjusted for
seasonal fluctuations,
all the figures are adjusted to
remove usual seasonal patterns.
And this is kind of actually nice, right?
That well, in some parts,
like all the figures are the same,
all the numbers are very similar.
But in other places it seems to have
learnt quite interesting things.
So, are adjusted for seasonal variations,
are adjusted to remove
usual seasonal patterns.
So that's actually quite a different piece
of word choice and syntactic instruction.
This learnt quite nicely that
they're very similar in meaning.
Knight-Ridder wouldn't
comment on the offer.
The two closest sentences
were Harsco declined to say
what country placed the order.
2.Coastal wouldn't disclose the terms.
Those ones aren't quite so
excellent, you could say.
I mean, to be fair, I mean,
something that you have to be aware of is
that there are limits to how perfectly
you can find other sentences that
mean roughly the same thing.
Cuz this is only being run over
corpus of about 40,000 sentences so
except the sort of fairly formulaic
utterances that get repeated quite a bit.
Often you're gonna have to be choosing
sentences that sort of somewhat different.
So you know,
there probably aren't other sentences with
Knight-Ridder not commenting on the offer.
So but you know, some of them
are perhaps a little bit too different.
Declined to say what
country placed the order.
But nevertheless,
it does seem to have captured something as
to what the main semantics is going on.
So, all of these, so
this first sentence is a company
not wanting to say something
about some transaction.
And both of these two closest sentences.
Also a company not wanting to say
something about some transactions.
So, there is a sort of meta-sense
in which it does seem to
capture the semantic
similarity pretty well.
Final example here is sales growing.
And again, the two sentences that are
closest to that are both other sentences,
when sales are growing further.
So, that was kind of nice and
that still seemed to work pretty nicely.
We still kind of weren't really
convinced that we're doing
a great job at capturing
semantics of phrases.
And there was still a worry.
That things didn't work very well.
So now, we change things so
that we had different Ws depending on
whether we're combining an adjective and
a noun, and a verb and its object,
or whatever like that, but
otherwise we still had the problem that I
mentioned, if for that hadn't gone away.
That when you're doing this
matrix vector multiply
that what you're doing is you still got
kind of half of W as being multiplying
itself by c1 and
half of W is multiplying itself by c2, and
there is no real
interaction Between c1 and
c2, and that just doesn't actually seem
what you want for natural language.
Oops.
Okay, sorry.
That was a bit we did before.
And so, in particular, you know,
what every semanticist has observed and
worked to account for in symbolic
theories of semantics for the Last 40,
50 years is what you actually get in
natural language is that you have words
that act as operators of functions that
modify the meaning of the other words.
So that if you have
something like very good, or
extremely good, or quite good,
or any of those things.
It seems like what you have is
you have good that has a meaning.
And then very is some kind of operator or
function
that'll modify the meaning of good, to
make it sort of more extreme and strong.
Or weaker, depending on whether you're
saying very, extremely, quite, etc.
So, it sort of seems like somehow we'd
like to be able to build neural networks
that capture those kind of ideas
of composition for language.
So, the last thing I want to mention
today, is this was sort of then
version three of how might we
capture that composition, and
so essentially,
if you have a vector here for good, and
then you want to be able to modify it's
meaning with an operator, like very and
how might you do that,
kind of a natural idea from sort of
linear algebra to think about is,
well, what if I made very a matrix?
Then I can do a matrix vector multiply,
and
calculate a new vector, and
that could be a meaning of very good.
And there had been some
previous work that for
particular phrase combinations
had done precisely that.
So, there's been a paper or
two that looked at,
let's describe the meaning of
adjective noun combinations or
adverb adjective combinations by doing
that kind of matrix-vector multiply.
But we wanted something more general
that could be applied to whole sentences
arbitrarily to come up with a meaning,
and so,
this came up with a model where sort
of we were going to have matrixes and
vectors and we were going to combine
them together in all ways to try and
create sort of meanings for phrases, so.
If we now had very good,what we gonna
say is well we are not quite sure
when something is gonna be an operator and
when it's gonna be operated on.
So let's just have it both ways and
see what comes out.
So, each word is going to be represented
by both vector and the matrix,
so very and good are both represented
by vector and the matrix and so then,
to compute representation of the phrase
very good, what we are then going to do
is we're going to multiply
the vector by the matrix.
So, big a with little b and we're going
to multiply big b by little a and
so, we're going to do both
vector matrix multiplies and
then having done that we're going to
concatenate those as we've done before.
And then we're gonna multiply it, put it
through a neural network layer just like
before so we have another W matrix to
the decide which of these to use and
how, it goes through a tanh and
that gives our parent representation.
So at that point we've got a vector
representation for the parent.
And that sort of looks hopeful.
But well, we wanna keep on building this
up into a representation of whole phrases
and we wanna be able to
build very good movie.
And well, at that point,
we sort of conceptually wanna be using
the word vector for movie and would like
to say well this is another operator and
we'd like to multiply very good by movie.
So, to do that, we're gonna have to also
have a matrix coming out for very good.
And so we wanted to build one of those,
and so.
We said well in addition to that,
what we could do is put
together the two matrices A and
B that we got down here.
We can concatenate them to
build a bigger matrix and
do a matrix multiplier on that, and
that would then give us a representation,
a matrix representation of the parent.
And so formally now our parent has- will
now have both a vector and a matrix.
So, I'll show you a bit about this model.
It could do some quite interesting
things and it also had some weaknesses.
And I think- In terms of
where the weaknesses are,
if I give the game away
right at the beginning,
a lot of the weaknesses were in
this part here for the matrices.
Cuz we sort of had problems with that.
I mean, firstly,
the matrices were kinda problematic,
because matrices have a lot
of parameters in them.
And so, that makes it hard to learn them
effectively, and that's an idea that
perhaps be revisited using some ideas
that have come up since then, but
secondly we didn't have a very good
way of composing matrices to build,
picking new matrices so
that part of the model perhaps wasn't so
great, but here is sort of a picture that
sort of shows you some of the things
that you would like to be able to do and
whether models can do them.
So, we're looking now at building two
word combinations, so, fairly annoying,
fairly awesome, fairly sad,
not annoying, not awesome, not sad,
unbelievably annoying,
unbelievably awesome, unbelievably sad.
And then what we're wanting to
do is take those phrases and
interpret their sentiment and
put a probability
distribution over sentiment
scores between 1 and 10.
So, 10 is extremely good,
and 1 is extremely bad.
And for some pairs of words, we actually
had some sort of empirical data that was
kind of connected to these meanings.
The empirical data was kind of
a sort of distant supervision,
not to be trusted very much.
But this was sort of saying well suppose
we'd seen a review of a movie that
said not said and that what rating
was being given to that movie and
it was sometimes a bad rating,
occasionally a good rating.
And that sort of was shown in that
one example for the red line.
But what's perhaps more
interesting to see is that
the sum of these
Combinations that the plain
RNN is actually more effective than you
might have thought it first it will be.
So if it's something like
unbelievably awesome,
that can be captured pretty
well in the model captures,
that unbelievably awesome that even
the basic RNN model is good at knowing,
that means it's a very good movie and
very positive sentiment.
We can even do some sort
of more interesting things.
So if you have the phrase unbelievably
sad it turns out that that's kind of
ambiguous that there many really good
movies that are unbelievably sad.
So there's a lot of weight over here and
then, there are some cases
where people say, a movie is unbelievably
sad, because it's just terrible.
[LAUGH] And so, you're actually
getting this sort of U shape,
where you get some weight on
both ends of the spectrum.
And, interestingly, even the plain RNN,
and also our matrix vector
RNN is able to capture that pretty well,
so that's kind of nice.
But, there are some things
that the basic RNN just
isn't able to capture right,
where our new model does a lot better.
So if you look in these middle row ones,
like not annoying and not awesome.
That, by itself, the word not just tends
to be a marker of negativity, right?
People who go around saying not a lot,
negative people.
So all else be equal if you
see the word not in something
sentiment is more likely to be negative.
And then the word like annoying
that's a negative sentiment word.
And so for the basic RNN model if you put
those together they just have a kind of
additive effect it turns out, so that's
the green line where it says not annoying,
it predicts that means bad movie low
rating, whereas the result we'd like
to have is that not is an operator and
it modifies the meaning of annoying and
this means like it's not so
bad and it's not annoying.
And the interesting thing about natural
languages is that they sort of don't work
like sort of basic logic where basic
logic might tell you that not annoying
means this is a good movie because it's
not annoying where if real human beings
when they say words like not bad, or
not annoying, they mean it's sort of okay.
And so,
the distribution that the matrix
vector model comes up with
is sort of flat, but it's sort of at least
basically right that it actually gives
the highest probability to the midrange of
the distribution, which is kinda correct.
And you see a similar
effect with not awesome,
that the basic RNN isn't
able to capture that,
it's still giving most of its weight
to saying I've seen the word awesome,
this is a good movie, whereas the matrix
vector RNN is at least done sort of
better and it's tapped down, giving weight
to meaning that it's a great movie.
So it sort of seems like we're kinda doing
better at being able to
model meaning combinations.
Yes?
Okay, so the question is, is it really
wise to be doing it both ways with
the matrix and the vector or could you
actually use your syntactic structure and
know which way to apply these two
things and do it only one way?
Yeah, so, these results,
even though all my phrases have
the operator on the left and
the thing being modified on the right.
This is sort of running that
symmetric matrix vector model, and
it's just doing it both ways.
So I think you're totally right,
it wasn't something we did in this work.
But it seems like you can very
reasonably say well wait a minute.
Why can't you take both of these two
models that you've just shown us?
If you're using the sort of
syntactically untied model,
you know that if you're doing adjective
noun combination you should treat
the thing on the left as the operator.
And then you can just run
it in one direction and
I mean, in general I think that'd
be a very reasonable thing to try.
It sort of means that you have
to have complete coverage.
And so you've decided in every
case of when two different
categories come together.
You have to decide for sure which is the
one you're going to treat as the operator
and you know that might actually
require quite a bit of work to do.
In principle you could do it and
that would work.
Yea, so we're basically done for
today but just to show you
sort of one other thing that we were able
to do with the matrix vector model, which
was sort of a nice example of how we're
able to use this model to do an NLP task.
So this was a task of learning
semantic relationships as
a kind of relation extraction task.
So, this was a data set that some people
had explored where you had sentences like,
my apartment has a pretty large kitchen
and then what you meant to say was,
what is the relationship,
between apartment and kitchen?
And there were set of relationships
that they could have and
one of them was that component whole
relationship and so the correct answer
here was to say that the kitchen is a
component of the apartment and then there
were various other relationships that
could be a tool what you were using.
It could be the material
something was made out of and
various other kinds of relationships.
And so,
we explored using this matrix vector,
recursive neural network to
learn these relationships.
And so, the way that we were doing that
Is we were sort of building up semantic
compositions using the matrix
vector RNN model, and so
we build up this semantic compositions
until you reach the point where the two
noun phrases of interest joined together
in the structure of the sentence.
So the movie showed wars, so
this was the sort of the message,
the content that's being
shown on this media.
So, where they join together,
we'd say, okay, we've built
up a semantic representation that
covers the movie showed wars.
So at this point,
we then use another neural network,
that's just a straight
classifier that says okay,
classify the relationship here,
it's an example of message topic.
So we built that model and this is
a nice example where it seemed like we
weren't able to show, that again, having
this extra power gave us extra power.
So here are just some results on that.
So people had worked on
these data set previously.
So as a sort of basic support
vector machine got about 60% F1,
MaxEnt model so
it's like a logistic regression
model with a lot of features, got 77%.
SVM model with a huge amount of hand
built linguistic resources and features.
So this is using everything, it's using
WordNet, dependency parses, Levin verb
classes, PropBank, FrameNet, NomLex-Plus,
Google n-grams, paraphrases, TextRunner.
Every feature and
knowledge source you could possibly
think of to throw into it got 82.2%.
So here are our results with
our neural network model.
So the plain recursive neural network
got held a little under 75% so
that's actually pretty good when all its
doing is learning its own semantics,
representations that puts
things together but not quite.
Winning the pack.
So the matrix vector model actually
is clearly doing something useful,
so it's sort of getting you
about 4% better scores.
So that shows we have made some
progress in semantic representation.
Of course, like everyone else,
we wanted to have our model better
than the last people's model.
So then we sort of built a model that sort
of put in a few semantic features, but
only, sorry, a few extra features,
but only fairly basic ones.
WordNet, part-of-speech, and
NER, and that was sort of, hey,
I wanna push this just over the line, but
I think the main message is that you're
sort of starting to get decent semantic
models of phrase relationships without
actually having much more than just these
continuous representations of semantics.
Okay, I'll stop there for now and
we'll get to more next week.
