Welcome to the brave souls who are here in person.
Get to see this thrilling report, first-hand.
We are here to talk about, uh, Bake-off 3.
Um, if you've already forgotten,
uh, what we were doing was,
developing the best relation extraction systems we possibly could.
Um, uh, highly multi-class problem and we are
looking at F1 for- to determine the best systems.
Histogram of the results from about 100 different, um,
models here compared to where our simple bag of words
featurizer baseline was from the homework.
You can see this is very exciting.
Um, most of us managed to beat that baseline. So that's great.
Um, we've been doing this thing where we use our, uh,
VSM code to analyze the code variables views to see if we could find any trends.
Um, in this case, it's really noisy.
So we're gonna skip this, and, uh,
instead, look at, um,
something that's a little bit more anecdotal.
But we did- we went through the top 15 and the bottom 15, um,
systems and tried to draw out patterns of wha- what were some of the commonalities.
And, um, the top-performing systems,
I think the biggest- the biggest trend is, uh,
caring about your features,
seems to be a lot more productive than caring
about your model or your modeling architecture.
So, um, the top systems,
um, actually we- we can see here.
Uh, 66% of the top systems used just a simple logistic regression.
The other 33% used the random forest.
Um, and really it was in just the- the- the wealth,
the voluminous- voluminousness of, uh, featurizers.
Where, uh, that really distinguished the top systems.
Um, so it's kind of fun scanning through this code
because it's like playing Mad Libs with the words left,
right, middle directional, POS, bigram, and trigram.
So you can kind of combine- like pick three,
combine them, and do that 10 times.
And you'll have like a really good system.
And definitely combining different kinds of features was,
uh, was the biggest kind of- brought us the biggest lift.
So on average, the top 15 systems had five featurizers,
compared to the bottom ones which generally just had one.
And so I think that- that shows that you can kind of
get along get pretty far by thinking about,
um, what are the inputs to the model.
So one example, the very top system,
congratulations to this group.
Um, so this- you can see here, uh, we have,
I think this is like ten different,
um, ten different featurizers.
This group did include some GloVe,
uh, features as well.
And this kind of on repeated runs,
um, brought them in just above,
um, our second place.
Um, who also had a, um,
nice long list of featurizers but,
um, uh, the main difference here being,
did not include GloVe.
And these- these are good.
These are good numbers here, we're- I'm delighted by them.
And then- yeah, the less magnificant- magnifi- magnificent systems,
um, you know, generally,
there was more exploration of different modeling, uh, model type.
So some people tried different, um, you know,
neural classifiers or different like an SVC or something.
And generally, just a very short list of featurizers.
Um, and so in- it was- it was
fun to see in some of the top systems how there is like tons and tons of
code trying all these different models that was just ultimately
abandoned for the default mode- model factory from the homework.
Um, all in all, uh, loved looking through the systems,
a lot of creativity and I think we can all go to sleep proud that we as a group have,
um, managed to beat our baseline so- so resolutely.
So thanks for a really fun bake-off and,
you know, one more to go.
Um, all right. Thank you.
Any questions?
Yeah, any questions? Yes.
The median?
I'm sorry.
The median?
Whoa, I don't have that offhand.
It looks like, it's probably about 0.65.
That was the score that almost everybody got, that was the most commonly occurring one.
That's below, oh, I see that's below 0.7.
Yeah, that's about 0.65.
So the- the winning teams- that's really far above that.
Yes.
That's very impressive.
Yeah, really. That's true.
Anything else? All right.
Did you get that baseline number?
Mine is a lot higher.
Your baseline's a lot higher?
Ooh, maybe I'm being way too generous here.
Um, that was, uh,
just the number when I ran it.
So I guess, maybe I should have done it more.
Maybe I'm too optimistic.
Um, okay. We can check that again.
So maybe- maybe the big spike here is right around the baseline.
Okay. That's median, yeah. All right.
Good point. I'll run it again a couple of times.
We'll see where it shakes out. All right.
Time for the real star
of the show
[NOISE].
Alright. Thank you Moritz, that's great.
And I guess it's worth pausing and just saying,
so Bake-off 4 has been launched,
the poster's up on Piazza.
Um, and I know a lot of you have really
invested in the systems that you produce for these four bake-offs.
And for people who did that,
you should feel proud.
I mean, you've worked out in a deep way on four very different problems in NLU,
which is maybe four more deep,
uh, analyses than other courses offer.
And I think that's a real accomplishment, right.
You've got real expertise now in in those four different problems.
um, and some of you have done
incredibly impressive systems for multiple of these bake-offs which is again,
just a very impressive accomplishment.
Um, yes, so Bake-off 4 is up, uh, due Wednesday.
That's the final one for us,
and then we'll move into a more project oriented mode.
And I guess we'll talk, um,
a lot more about that in the coming weeks.
The- the lit review after
the bake-off is the next thing that's due- that's due on May 13.
Um, and there are guidelines, uh,
at the website about what that should contain.
Um, we could look at them now just in case there are any questions,
actually that might be healthy, let's see.
Just so you know where they are and everything.
So it's a short paper,
um, you can do in a group.
You should pick one of the final project groups on Piazza to do that.
So that credit gets assigned in the way that you expect.
Um, groups of one should do 5 papers.
Groups of two should do 7.
Groups of three should do 9.
So that- those numbers there are kind of encouraging you to work in a large group.
It's by no means required but you have to do overall
less work in terms of reviewing papers if you do that.
And the other thing I want to say here,
is just that, this is more than just a lit review.
So it is also attempting to push you to
begin thinking about what problem you're going to tackle.
And I will say that,
given the time allotted,
the ideal thing is that over the next week or so,
you and your group just settle on a problem,
um, and then go forth and do it.
Changing in midstream, there just- isn't a lot of time to do that.
And I feel like that leaves you behind.
So you should try to figure things out now and then just go forward with that plan.
Um, and this is meant to sort of encourage you to
start that problem solving because we want you to state,
for the papers that you chose,
what their general problem is.
That's a good creative act,
a way of kind of summarizing what you've studied.
You should offer some concise summaries.
Um, but what you should really be doing there is trying to tease out maybe
the major contributions and also getting to three here like,
where these papers are alike where they're different.
Maybe where they're starting to fall down or where they've
left gaps or where there's obvious room for improvement.
Maybe like, it's an older paper and you know about some newer models,
that would be worth trying on the task,
in which case, that's an obvious thing to point out because that might start
to point out the outlines of the projects that you ultimately want to do.
Uh, and as we say here,
this is the most valuable.
This could be the basis for the lit review section.
Because in the paper that you write for the lit review section,
the goal there is not to just rehearse what's been said,
but rather to contextualize
your work and sort of justify the modeling choices that you made.
And this compare and contrast is going to lay the groundwork for doing that, I would say.
So that's the part that you'll want to give real thought.
And then obviously future work suggestions for how these papers could be extended.
If you do that in a thoughtful way,
then it's an obvious discussion place for you and
your teaching team mentor because your mentor can go in and say like,
actually, I think the first one is resolved,
but the second one is absolutely worth pursuing.
The third one is maybe too hard.
So why don't you think about two as kind of
a guiding thesis for your work on a project, right.
That's an incredibly productive thing that could happen if you
think really seriously about kind of next steps for this.
And then we also want you to have a proper references section.
We're gonna be nitpicky about that.
You will lose some credit if you don't have a proper bibliography,
so just follow through on that.
And again like, that's going to carry through all these assignments.
I think it's just part of scholarship.
We're not picky about the format for this thing,
but we are picky that you have a proper bibliography.
And that's the spirit of this point 5 here.
So think about this with your team,
get to work on this as a document.
Uh, and do try to use it as a chance to really
lay the groundwork for your project because then you'll be off and running.
Um, any questions about that?
Good. So after that,
the next document for
the final project is the experimental protocol, as we're calling it.
There are also, um, requirements for that.
I think I won't review them right now.
Except to say that, this is part of us kind of
pushing you in the direction of having a quantitative analysis that uses data.
That's a value that we've encoded in this whole project structure.
And the protocol is gonna be a kind of
check-in that your project will succeed in that way,
and that you're also doing the right things in terms of setting up
baselines and that you have the proper metrics, and things like that.
Uh, so once you've done that,
I think you really have like,
the foundation for your paper in terms of its experiments and everything.
And then the final things after that are more presentational.
So a video and then the final paper itself.
Good.
Okay. So I propose that we dive into the material.
Today we're gonna talk about grounded language understanding.
This is like maybe my favorite topic for NLU,
and this is kind of where we finally start to
realize what you might think of as the true dream for NLU,
which is like that you'd have a robot friend or something.
Um, and the reason I emphasize grounding here is just because I do
think that if we're gonna achieve that goal of having robot friends, uh,
they're gonna have to exist in kind of social physical environments in some sense,
or at least be attuned to the fact that humans live in such environments.
I think they will not succeed if all they have done is
experience text flying through them and maybe producing text.
Right? Uh, and I'm gonna try to substantiate that claim a little bit.
But I think it's pretty intuitive that a big part of the human experience with language,
but just in general is being grounded in physical and social situations.
And so we leverage a lot of that information in language use, production, and understanding.
And so all of that brings me back to the fact that if our systems are going to succeed,
then they'll need to be grounded in these ways.
And so what I've tried to do is just assemble kind of
the core building blocks that I think you can
productively use to start to design systems that are grounded.
And at the end of the lecture,
I'm going to suggest some ways in which even if you don't
think of your problem as a grounded NLU problem,
there might be ways that you could bring in some of that grounding,
and thereby enrich your system in terms of its performance,
but maybe also just in terms of the rich ways that you could conceptualize your problem.
And I've framed this as a linguist would.
So I'm gonna give you some history,
and then I'm gonna talk about speakers.
Those are gonna be agents that observe some non-linguistic thing,
and produce a linguistic expression.
And then a listener is an agent that consumes language,
and makes an inference of some kind about the world.
And then they come together in chatbots,
but the thing that I want to emphasize there is that I think for a chatbot to succeed,
again they need to be doing something that's grounded in something language external.
If they're just talking back and forth at each other with no purpose or goal,
um, they're not gonna succeed at doing anything like what we want them to do.
So I'm gonna emphasize grounding there.
And then finally it gets most pragmatic when we talk about reasoning about other minds.
This would be how I could be a more effective user of language.
If I think not just about production or comprehension,
but also about what it's like when I'm a speaker,
to be a listener, and when I'm a listener,
what it's like to be a speaker.
And that can enrich the system that you're using,
it can fill in lots of the gaps in terms
of the actual sentences we exchange with each other,
and I think it keys into something deep about cognition,
but also about kind of efficiently learning from data.
That's another angle that I'm going to try to
convince you of when we talk about those models.
And then finally, just a smattering of things that I didn't really have time for.
But let's start with some of this history.
This is a slide idea that I stole from Andrew McCallum,
um, and he just encourages us to think, maybe in 2019,
now, uh, about the movie 2001,
because you could think of that movie as in part
making predictions about three things in terms of technology.
First, a prediction about what graphics th -
the computer graphics will be like in the year 2001.
About computer chess playing in 2001,
and finally about NLU in 2001.
Because the on-board robot HAL,
that is kind of the ship,
it conducts very natural open-ended conversations with humans.
And we could think about what life is like for us.
So let's do those comparisons.
For graphics, these are the graphics that they show for HAL.
And in 1993, so even well before 2001,
we had things like Jurassic Park.
So this would be a case where the creators of this movie just
dramatically failed to imagine how rich the future would be, right?
This is- it's kind of a failure of imagination that,
this is what computer graphics would be like,
and this was the reality in 1993. Big gap.
We made a lot more progress than expected I guess.
For chess playing, you could argue that they got it exactly right.
So HAL the on-board computer plays chess like an expert, uh,
and just before that,
in terms of actual human history Deep Blue beat Garry Kasparov in 1997.
So that's pretty impressive that they more or less nailed the kind of
timeline in which computers would get good at a game like chess.
The third one, dialogue.
So if you've seen the movie, you can see that HAL has
these incredible human-like conversations with humans.
And it's not just that it has these conversations,
but that they are obviously like deeply grounded
in this row- this in HAL's plans and goals.
He's in fact conspiring against them for his own purposes and so forth.
But these are incredibly rich interactions, right?
By 2014, Bill showed us what this was like.
So this is the dream for Siri,
which is already not quite approaching what HAL can achieve,
but maybe at least I can get my tacos ordered or something.
Uh, but then Bill showed us this funny skit from Stephen Colbert,
which just showed the,
the gap between what Stanley Kubrick imagined and what we actually realized about,
I mean, even, uh,
more than a decade after 2001.
And that's sort of funny, because like if you compare, you know,
chess right in the middle,
but graphics, this failure to imagine.
And then when it comes to NLU,
this kind of, uh, overstating what we would actually achieve.
And I- I think that comes back to the fact
that humans using language, it feels effortless.
It feels like something that we ought to be able to master.
And in fact like, all throughout the history of AI,
you have people stating that they're just on the verge of solving all these problems,
that they kind of see now that it's the perceptron or,
you know, computer programming in Scheme,
or now Deep Learning.
These things are just about to solve all these problems,
and they always turn out to be elusive.
They always turn out to be harder than we imagined.
And I guess I'm going to try to eliminate part of why that is.
I like this quote from Terry Winograd, uh,
that kind of keys into this grounding idea language is action.
All language use can be thought of as a way of activating procedures within the hearer.
We can think of an utterance as a program,
one that indirectly causes a set of operations to be carried
out within the hearer's cognitive system. It's kind of nice.
It keys into the goal-oriented nature of this,
and also the social component.
Also by way of conveying to you why
this problem is persistently more difficult than people think,
I thought I would present this analogy that I
use when I teach semantics and pragmatics as well.
This comes from the linguist Steven Levinson.
And he asks us to imagine,
you know, this- there's a Rembrandt sketch here,
and just observe for a second that you can,
even if you've never seen it before,
kind of make out that there's a figure in the center.
There seemed to be some people, uh,
kneeling at the feet of this- this figure in the center.
There's an archway or something,
some kind of buildings is in the background.
And if you have additional context for Rembrandt sketches,
or for sketches like this,
you could probably even make the inference that this is a Christ figure in the middle,
with disciples around, right?
You probably have other images in mind that kind of resemble it,
that are more high fidelity.
Levinson's point is that this is incredible.
If you really look at this sketch,
you start to realize that it is
just the barest outline of all of that rich stuff that you were able to infer.
I mean, it is incredibly sketchy.
It just- at some level of detail,
that's like a random collection of lines.
But we are geared as humans to be able to take
those random squiggles and infer
all that richness from it and you do it at kind of multiple layers.
Right? So like, even if you don't have much context,
you can infer that there's a figure.
If you have additional social and cultural context,
you can figure out that it's a Christ figure,
and if you know something about Rembrandt, right,
this could get arbitrarily deep because you- when you interpret that picture,
you're drawing on much more than just the raw visual image there.
You are able to imbue it with lots of other information.
And that's a kind of miracle about
human vision and the cognitive stuff that goes along with it.
Levinson's point is that utterance interpretation is very much like this visual act.
When people speak to you,
you are able to extract much more information from
their utterances than seems to be encoded in those utterances directly.
We do an incredible amount of kind of what you might call reading between
the lines or figuring out from somebody's very sketchy utterance,
and the context you're in,
and the goals you take them to have,
and what you know about them as people.
You're able to infer much more than what's just directly encoded there.
And in fact, it might be that most of
the information that you extract is not coming directly from what's encoded,
but rather from all of that other stuff working in
conjunction with what people actually said to you.
Just like here, the lines tell
only this very small part of the full story about what you extract from the image.
And the naturalness of what you do with the image is akin to the naturalness of
what you do when you interpret people's utterances or
when you expect others to interpret your utterances.
You don't reflect on all of the gaps that people are filling
in and you're apt then to underestimate just how hard that is.
But when you confront NLU problems you
start to become aware of very quickly of just how much of this is
implicit and coming from context and
other kind of like things that aren't encoded in the language that we use.
And I like this analogy a lot.
I mean, this is sort of like vision would be really easy,
you know, for robots if everything had a barcode on it.
And language understanding would be really easy for
artificial systems if everything was
limited to what's encoded in the language that we actually utter with each other.
The language itself would be incredibly
cumbersome but at least it would be easy for machines just like those barcodes.
But the reality is that so much of it is coming from elsewhere.
And that's why I think that in the end,
for- to have a rich NLU agent,
you're going to have an agent that has lots of other human-like capabilities.
Let me highlight a few of these things just to kind of random collection of
things that I like from pragmatics that
sort of point at how hard this problem is going to be.
So I'm starting with what linguists call indexicality.
Like, I am speaking.
The indexical here is the obvious one,
is the first person pronoun, I.
So you might think that's a pretty easy problem to solve if you have some kind of grounding.
I mean, obviously if you've just got text coming in and out,
then figuring out what I means is a kind of bizarre question even to ask.
But if you think about today's devices like your phone,
this is something that you might have
a pretty sophisticated view of already like the phone could
just at some level know that the referent of I is the current user,
or the owner of the phone,
or something like that and that will be correct a lot of the time.
Already, I think that's pretty sophisticated,
but that's just the easy case.
So what about, we won?
The meaning of we is constrained probably to include the speaker but
the actual collection of people that it refers to is incredibly context-dependent.
It could be a team I'm on,
it could be a team I support,
it could be the people in my community,
it could be the people in my country, right?
You just this- every way that you could possibly
imagine this going could be a meaning for we.
So already when we move from first to third person,
we're confronted with essentially all the problems of pragmatics.
I am here maybe the I is easy but what about the here?
Could be the classroom, could be Stanford.
It's maybe not gonna be interpreted as planet Earth because
that's not such an informative thing to say in 2019.
But maybe someday, I am here will be a meaningful thing,
where here refers to planet Earth.
What about we are here pointing at a map?
So that's an incredible kind of abstraction that with a visual depiction,
somebody can point and have that transfer to the physical location,
and that it doesn't mean that you're at the point on the map.
I'm not here now.
This is a wonderful example from linguistics and philosophy of language.
But I'm not sure that people would use this anymore because there are
no landline phones and therefore, no answering machines.
But in the old days your answering machine might say I am not here now,
which is a wonderful little puzzle because how could it
be that the current speaker is not in the place that they are speaking?
We went to a local bar after work.
This has the challenges of we.
And I was just highlighting here that adjectives like local,
their interpretation is again like here.
It's heavily context dependent and highly
variable in terms of exactly what kind of thing it refers to.
And there are lots of expressions like this.
Three days ago is indexical on the time of utterance in terms
of how many days back you go tomorrow and now work in a similar way.
And now, of course,
has all the uncertainty of here.
It's just in the temporal domain.
So those are some small words.
I think the only way that you could get a grip on them in
any sense is with some grounding and then of course,
you see that they have like- they're all the problems of pragmatics writ small.
Here are a few others context dependence, where are you from?
This is a frustrating question for me.
I think about this all the time because when people ask me where I'm from,
I never know what sense they mean.
Do they mean Connecticut, where I was born,
or the U.S., where I'm a citizen, or Stanford,
which is my affiliation or planet Earth, again,
nobody asks that but maybe we'll live in an exciting future where they do.
This is coming from where and the reason I highlight this example is it's
a kinda mundane question but already you can see
that it keys into the goals that the speaker has.
They're trying to figure out something about you and
the sense of the question is determined by those goals.
Now, those goals might be kind of under-determined.
I feel like when people ask this,
they often don't know what kind of information they're hoping to get back and you
have this awkward little dance about whether you resolved their issue.
That's an added layer of complexity on top of the fact that,
to know what the question means,
I have to know what your goal is and that implies a sophisticated kind of grounding.
Here are some more. I didn't see any,
this is kind of keying into ellipsis but also what linguists call quantifier domains.
So are there typos in my slides?
I didn't see any, you might but.
Are there bookstores downtown? I didn't see any.
Are there cookies in the cupboard? I didn't see any.
Right. There's no way I'm going to get a grip on what this utterance
means unless I know at least what the preceding discourse context is.
But then you also have to do some really sophisticated stuff about figuring
out how essentially to expand the meaning of this elliptical phrase here, any.
And that again has two layers of complexity.
First, I have to figure out what thing you were talking about but then I might
also have to do some domain restriction in the sense of like,
when I say that everyone entered the bake-off,
I don't mean everyone in the universe.
You guys effortlessly figure out that when I say,
everyone in this context,
I mean the students of CS224U,
but in a different context,
if I use it in my house or in my department,
everyone takes on a completely different sense.
The light is on. Chris must be in his office.
The Dean passed a new rule.
Chris must be in his office.
Here showing that modals have a kind of sense ambiguity.
This one is epistemic and this one is deontic.
So this one is keying into knowledge and here
you're offering your evidence that the light is on,
that's evidence that he's in his office.
This one is deontic in the sense that it keys into laws of the campus in this case.
And in this case, this is the relevant rule.
They have very different senses.
This is pervasive in language.
Modals exist not just as these auxiliaries like must and may, and
can and should, but also in adverbs and things like that.
And a big part of figuring out what the statements mean is figuring out
the sense epistemic, deontic, circumstantial.
There's lots of different flavors.
And again, we do this without much reflection.
Classic example, if kangaroos had no tails they would fall over.
It probably seems true but suppose they had jet packs.
If I am able to insert that into the kind of
implicit counterfactual reasoning that goes into this conditional here,
then it feels false, right?
They wouldn't fall over in the worlds while they have jet packs,
they can just fly around.
Somehow, you know not to consider that possibility until I introduce it, in which case,
maybe your judgment about the truth of
the statement changes and
that highlights the flexibility of your reasoning in these counterfactual situations.
And also again, the fact that you didn't really
reflect on it when I gave you the initial example.
But it's certainly important to be able to do that to
figure out what these claims that are even saying.
There's another example that I love.
These is two books, I think they're real books.
What they teach you at Harvard Business School,
what they don't teach you at Harvard Business School,
and then there's a masterful tweet,
"These two books contain the sum total of all human knowledge."
[LAUGHTER]
That person is thinking like a logician.
The actual sense of both of these titles,
I guess is that there is implicit domain restriction on these whats here and so it's not
this knowledge and its complement in the full space of all the things that can be
knowable but rather they both have taken on a much more refined sense.
Perspectival expressions.
This is another kind of grounding left and right, front and back,
former and latter, um,
and I- this was a case where somebody's confused about perspective.
This is from the website,
You had One Job.
Please when using the stairs,
stay to the right when going up,
stay to the left when going down.
This will keep people from running into each other.
It will not. And so here just to kind of summarize this.
I have this fairly mundane statement.
Many students met with me yesterday and in a kind of informal way,
I just highlighted all of the ways in which to understand that expression.
You need some grounding in the physical or the social context.
So like how big is the set of students?
For many students that's relevant.
Uh, what's the time of utterance for understanding these temporal expressions.
Uh, who's the speaker for me,
what's the additional contextual restriction on students which
might affect the domain of these students here, what's the time.
And then you do other kind of things that we call Gricean pragmatics like,
if you said many students met with me,
does that mean it's false that all or most did for
the constrict, the relevant set of students and so forth, and so on.
Uh, this is- as I said, just completely pervasive.
Any questions about that before I switch gears a little bit?
That's a quick run through my greatest hits of phenomena in semantics and pragmatics.
I could do that all day.
If you want much more of that you could take 130A with me next winter.
If you haven't already. I know some familiar faces here.
One thing I love about older work in natural language understanding,
is that it seems like they had fully
internalized this message that I just sent about the importance of grounding.
Because from the beginning,
the systems were grounded in,
in particular physical spaces and like
a really influential paradigm case of this is Terry Winograd's SHRDLU system.
Um, there's a YouTube video that kind
of depicts what this was like but basically it was a,
a blocks world in which things could be moved around.
The blocks have colors and so it's a very finite domain.
Uh, and you could issue instructions into
this world and then the blocks would be manipulated.
And the, the beauty of that first of all,
it's just that from the start,
these dialogues with the system were fully grounded in that phys- in that world.
And then when you look at actual examples of things that the system could do,
you can see just how much, at some level,
the system was leveraging the fact that it was grounded in that space.
So like what does the box contain?
The box is an expression that you can
only make sense of if you know what environment you're in.
That is if you, you and your listener can figure
out what the speaker's intended referent for the box is.
Um, the blue pyramid is the same way.
What is the pyramid supported by?
That's kind of anaphoric to utterance 2 there.
The boxes- the same box as 1.
Um, how many blocks are not in the box?
Again, that's an impor- that may- that's
a very difficult question to make sense of unless you're grounded in an environment.
Four of them is referring back to blocks.
System could do that.
Um, the red cube, yeah,
all these things, kind of- I mean,
the only way any system I think could make sense of a dialogue like this and
perform these actions and answer these questions is if it was grounded in a world,
like the one that Terry Winograd had constructed.
And then there's another contribution that Terry Winograd made early on that I
think still influences how people think about NLU.
And these are these sentences called Winograd sentences or schema sometimes.
Um, here's one. The trophy doesn't fit into the brown suitcase because it's too small.
What is too small?
And you're able to figure out that that means
the suitcase and the reason you're able to do that is because you have
an ability to kind of simulate mentally
what these objects are like and how they're related to each other,
which is coming from the fact that you live in
a physical world and you have a lot of these experiences.
So you're able to figure out that there, it's the suitcase,
whereas the trophy doesn't fit into the brown suitcase because it's too large.
What's too large? There, you resolve to the trophy.
Again, via the same kind of reasoning about physical objects.
The council refused the demonstrators a permit because they feared violence.
Who feared violence?
You can figure out that it's the council and you can do that
because of your understanding about social roles and their relationships.
If I change it a little bit,
the council refused the demonstrators a permit because they advocated violence.
Who advocated violence?
Then your interpretation flips to the demonstrators.
That first one is very physical in its grounding and
the second one is grounded in more like social constructs.
Uh, these are very difficult problems.
I should've checked on this but at least in- as of recently,
for the corpus of Winograd sentences,
even very sophisticated deep learning models were more or less performing at chance.
And people worked really hard and continue to work hard to kind
of construct systems that can solve these problems.
I suspect that the real answer rec- will come from systems that have a whole lot of
really sophisticated and robust kind of commonsense understanding about
the world and that it won't come from just pattern matching on examples like this.
And then finally another thing we could do is just take our cue from neighboring fields.
So for example, developmental psych is a field that is
all about how humans learn language,
at least in part it's about how humans learn language.
They have lots of lessons about this and after all,
humans are our best language learners by far in the universe, as far as we know.
Uh, kids can learn word meanings for example,
with incredible speed despite relatively little input,
certainly nothing like the amount of input that you would get for
a model that- even models that you've trained in this class.
How did they do it? The evidence suggests that they use information
that's inherent in the kind of ways that forms
contrast with each other but also social cues.
Assumptions about the speaker's goal and regularities in the physical environment.
Only this first one is the sort of information that you would get from looking just at,
for example co-occurrence information.
These other ones depend on grounding.
So some consequences for NLU.
As I said, human children are by far
the best agents at learning language and they depend heavily on grounding.
That's an important clue for us.
Problems that are intractable without
grounding are solvable with the right kinds of grounding.
You know, the SHRDLU cases are a dramatic example of that.
I think you won't get traction on dialogues like that.
Reproducing them or understanding them or anything,
unless you know about the blocks world that they're describing.
Deep lear- here's a bright spot for this.
It's a real change over recent years.
So deep learning systems,
they're really flexible toolkits for reasoning about
lots of different kinds of information in a single model.
And this has I think,
led to some real breakthroughs when it comes to grounding because after all,
one way you could think about grounding is,
I'm going to have some vectors for
my words but maybe I'll also have vectors for the people in
my environment or the objects in my environment
or like a vector representation of this image or this video.
And these deep learning models are happy to say I'll take
your word representations and your video representations,
smush them together and do something interesting.
[NOISE] Uh, and that's a very productive way to think about how you
might gather a lot of evidence and use it to make a decision.
Which is more or less the,
the lesson of this lecture here.
So we should seek out data sets and develop data sets that
include the right kind of grounding and to the extent that you can do that,
I think your system will get better.
Because at one level or another,
all the language that we use with each other is grounded if only indirectly.
That's my pitch, high-level description. Yeah.
Should also kind of edit like the [inaudible] web edit to sort of evaluations we're doing like,
even the Winograd schemas which seemed like
they're getting at something various and rich about language.
They don't have a, a video associated with that more like, you know, uh,
here's the still of the picture that I'm describing.
Um, do you think that would be something necessary in order to,
to kind of go along with these data sets?
I think that's in the spirit of what I'm saying, right?
Which you- so you could say,
I think that no system is going to get traction on your suitcase example with the trophy,
uh, unless I also show it lots of images of suitcases and trophies,
and maybe their physical relationships to each other.
And maybe that's indirect about exactly what we care about,
but maybe that's enough to push the system in the right direction,
and that would reinforce my narrative which is,
it wasn't there in the language but it
might be there in this other kind of representation.
Um, I'm also reminded of like,
um, it's related to things I've worked on that kind of key into your intuition.
So it would be very unusual for somebody to describe the fact
that the computer in a room is on the desk because that's the most mundane thing ever.
It's the default assumption.
If I'd describe the room to you,
I might not bother to mention it and I might just say there's
a computer and you'll infer that it's on a desk or something.
And it's only if it's on the floor that I would bother to describe that.
So given that bias that I have for describing things,
only things that are unusual,
how will your system ever learned the default stuff
and why might one answer might be that it can consume
visual representations and that fills in all those gaps and
things that we never describe in language and showing that would be wonderful.
You somewhat answered my question but the later part of the
the explanation to answer but I was thinking to try and take away
sort of one level of indirection and also maybe trying to
build a logic network.  [NOISE] So if you could recognize that
like a suitcase is an objects and the objects have
size and then try to sort of work out that simulated model.
Uh, is it another approach of trying to do the same thing?
Great example. And so they're,
they're the grounding in my terms,
might be a knowledge graph that has all this, all this information.
Yeah, and that would get you over maybe the hurdle of default things not being described.
Yeah. Great idea.
I'm gonna move on to these speakers.
This is gonna get more low-level now.
Because partly what I wanted to do is empower you to be able to do some experiments.
And I also think scientifically it's useful to start small.
I mean, those SHRDLU dialogues are really ambitious.
We're probably gonna get more traction
if we start with things that are kind of more controlled.
So that's the spirit of what I'm gonna talk about.
You'll see that a lot of this is gonna be about color reference.
Uh, I forecasted this before.
I chose color reference bec- for a few reasons.
So we have wonderful datasets for this.
Colors are great because it's a-an interesting kind of grounding.
You don't have to deal with all the complexity of visual scenes.
But colors themselves are cognitively complex and also pretty important to us.
And then the language that you get for describing colors is also quite rich,
especially if you don't constrain people to a color vocabulary.
So as a starting point, um, Randall Munroe,
the guy behind XKCD, years ago,
he did a color survey where he just asked people to describe colors.
It's an- it's a massive dataset with millions of
instances of people describing color patches.
Here are some examples.
And they go from the very simple things like green and purple up top,
grape which is a little more poetical, turquoise, rarer.
Then they get more descriptive, moss green,
light blue-gray, robin's egg blue.
Already some domain knowledge creeping in.
British racing green is not a phrase I'd ever heard before.
Baby puke green is a classic example from this dataset.
There's lots of puke colors in the dataset.
That's the cutest. Um, so thi- this is kind of what I mean,
like we have a,
a wonderful playground of reference,
these color patches and then the language people
produce in response is also very colorful.
Very first model to think about here is my most basic speaker.
And we're gonna do this in these kinds of encoder
decoder neural frameworks that I'm sure you guys have heard about.
This is a simple version of models you might have seen.
So down here, I'm gonna assume that what comes in is
like an RGB representation of the color.
So just three floats.
All right, this one happens to be HSV,
I think but some kind of color representation comes in.
You might do something to embed that color,
like some kind of transformation of it to make it higher
dimensional or maybe more cognitively interesting. So that'll happen.
And then you might map that into
a hidden representation which I'll call the color representation there in orange.
That's the encoder part.
This is a very simple encoder which might be just kind of a,
a feed-forward model that produces this hidden representation.
The decoder part for our speakers is more interesting.
So the way the decoding works is,
we just assume they generate a start symbol.
That's where they begin to speak.
That gets an embedding representation just like any other word would.
And then you produce a hidden representation which is
a function of the color representation and your word rep.
So this is like where we've chained them together with this red line.
These two have to have the same dimensionality,
because this color rep will be the initial hidden state of the decoder.
This is kind of the input.
And then the model makes a prediction,
and what it does here,
this is a kind of subtlety with what I've written as dark.
It's making a prediction that is a probability distribution over the entire vocabulary.
So this is a very high dimensional probability distribution.
Uh, that's an idea that can take some getting used to it.
It's a kind of miracle that this works because it is a classifier,
but it's a classifier into
this incredibly high dimensional space but things work out all right.
That's where you get your error signal.
So the truth in the data might be that the first word is light,
uh, for this color that we're describing down here.
And that- that's a one-hot encoded vector and you get a standard kind
of error signal there. Here's a kind of summary.
So embedding derived representation coming from these guys,
predicts a distribution maybe like with a Softmax classifier,
and that gives you your error signal here.
I've called this teacher training at the top
because even though we produce the word dark,
the actual data is light.
And so in teacher training here,
we just make light the next word.
So that is the input.
We look up its embedding.
We grid hidden representation from H1 and x37,
they come in and then we make another prediction blue.
And in this case, the error signal we did a good job there and we continue.
So blue, look up its embedding, hidden representation.
We predicted green, but the actual thing to do was to stop to produce,
produce the stop signals.
So that's another error signal for the model.
And in teacher forcing,
the model will stop even though it didn't produce the end token. That make sense?
This is a generalization of our model for NLI, right?
Where you ca- what, what we did for NLI was just look at the final output state.
That's where we get our label.
Here, it's like we get a label at every turn.
And then the other twist which might take some getting used to is
this teacher forcing idea so that even we- though we made a wrong prediction here.
We insert the correct prediction down here.
In actual practice, when you train these models,
I guess best practice is kind of to mix some teacher
forcing with some more exploratory work where the model,
even though we're like predicted the wrong thing here,
you actually insert dark as the next word and let it explore a wider part of the space.
So that's how training works.
And then at prediction time,
again, a color comes in.
That's what's given to us. Look up its embedding.
Get a representation, and then we have to just go on our own here, right?
So produce these predictions and stop making predictions when we hit the end symbol.
And in that case, here we produced this presc- description dark blue.
Does that make sense?
I don't know how many of you have seen these models before.
Um, on the one hand, it looks just like the NLI model.
On the other hand, when you think through how
training has to work and how prediction have to work,
this is a bit of a twist here.
The way you use the predictions as inputs to the next step but it is straightforward.
If you wanted for example to implement this straightforward to extend the PyTorch,
NLI model to do this kind of thing.
In a paper that I did with Will Munroe and Noah Goodman a few years ago,
we did exactly this model using the XKCD dataset.
We did a Fourier transform to get our initial representation here.
And then, another trick that we did is,
so that the model wouldn't forget what color it observed.
We appended the color representation to all of the inputs.
So that was kind of like at each token.
Remember that you're trying to describe this particular color and I think that helped it,
kind of, be guided in the expected way
by this little bit of grounding that we're giving it.
So [inaudible] to generate the weights and then use
the weights or was it actually build a model that is good to describe in colors?
We wanted the model that would describe colors.
Um, and then- so this is another nice side effect of working in a simple domain.
Uh, even though this is a big crazy neural language model,
we could actually get a lot of analytic insights about what it was
doing by looking at its representations in color space.
Uh, because like we can give it as many colors
as we want and see what things it produces,
and then we can look at the resulting representations and you could see like for example,
that it had learned that greenish typically excludes green in a kind of interesting way.
Um, stuff like that. But fundamentally,
we just wanted a model that would do
multi-word color expressions of the sort that are in the dataset.
Just kind of thinking again, with what mentioned earlier
It'll be really interesting to see if like you took
the NLI images along with the read captions and fed it through something like this,
if you get a contextualized.
weight vector,  it's kind of [inaudible]
I like- I like it.
It's there for you with SLNI you've got images, you've got captions.
Yeah. I'm surprised more hasn't been done with it.
Here's a small twist that brings and what it might be a more interesting corpus.
So again with Will and some people here at Stanford, Robert Hawkins and Noah and,
um, and me, we collected this colors in contexts corpus.
So this one is different in the sense that instead of
describing color patches in isolation,
people were given three color patches and
told which one was their target to describe to a listener.
And the idea was that they succeeded to the extent that
the listener could figure out from the same context which one was their target.
So here you've got some, you know,
a lot of simple color expressions in these very easy
to distinguish contexts, blue, purple, green.
The per- person just said blue.
But in this case here where there are two blues,
they said the darker blue one. And I love that, right?
Because that's grounding, that comparative the darker blue one is
making implicit reference to the fact that there are two very similar color patches,
and it's picking the darker one.
And this is even more explicit.
Teal not the two that are more green.
So these are all greenish colors and this person
was like very explicit about picking out that one.
Dull pink not the super bright one.
Not any of the regular greens very rich context reference.
And then I love these two. So purple and blue.
This is exactly the same color patch,
and you can just see that depending on
the context they were in people described that color differently.
It's a classic thing about colors that even your perception of them
can be somewhat influenced by the competitors in the space,
and here you see that reflected linguistically.
And you can have a simple speaker that does this.
I'm gonna return to this model but even this baseline is interesting,
where you just have it process three color descriptions,
embed them in the same space.
You, you just designate that like maybe the one closest to the end is the target,
and then have it described.
And what you get out of this model is color descriptions that
are implicitly context sensitive in the way that I just showed you.
We did that same trick of reminding the, um,
thing what the target referent was through its training procedure.
And here's some related ideas.
So you could think of what I just showed you with
color describing as a special case of image captioning,
which is a task I'm guessing you guys have heard about.
It was revolutionized a few years ago in part by researchers here at Stanford,
um, by, you know,
having these captions be generated by a neural encoder decoder.
Um, for those models,
the encoder part is typically much more involved than the one I just showed you, right?
All we had to do is embed the color in some interesting space.
But here we might have to embed an arbitrarily complicated image.
But the basic structure is the same, you know,
do some processing to your input object,
and then produce some language.
And then anticipating the colors in contexts thing.
There were a couple of papers from these groups at Google that
explored image captioning as a kind of contextual task.
So the intuition there is,
if you give me three pictures of buses,
it's okay if I say bus,
bus, bus for my captions.
But what you really might want there is captions that distinguish the three buses.
So what you want is a model that has some incentive to say like,
school bus, tourist bus, double decker bus,
or something that is like that because of the context and the confusability,
this pressure to be informative in that context will lead to more interesting captions.
And then finally, just kind of more open-ended here visual question answering.
That's a task where you're given an image and a question text,
and together you have to produce what I
would call a grounded answer to that question in the image.
Those are my speakers. Questions.
[inaudible] for when you have the colors up?
Um, is there any literature on like given an utterance and three contexts say,
uh, to choose the context in which that
utterance would have been needed to distinguish between colors?
So not just the target but the entire context.
Right. Here I was thinking of
this because I read the teal not the two that are more green,
and then I was thinking okay,
which one of these would uniquely be the one that would,
that would be necessary.
I don't know if stuff that's work on that
but it's a very interesting thing to think about.
It's another perspective like you as a human,
if I say the darker blue one,
you learn that there are two blueish things, right?
And you're saying that for this one you learned a whole lot
about what the whole context needed to be like.
Just because of presuppositions that you have
about the things speakers will do in those contexts.
I love that. Yeah. And potentially
you could use this corpus to get some traction on that problem.
Would there ever have been any work done on image generation from a caption?
Yes, I have a few reference, that would fall under my listeners.
So I have a few references for that too, right?
Image shimmery- yes.
Yeah, that would be like a listener.
So for the listeners again,
you could have the same dataset.
People have done this, same task formulation we'll have this colors and contexts thing.
Um, except now we're given the description,
like light-blue, and this could be a standard sort of RNN here.
And this will be the crucial hidden state.
You might do some transformations on
that hidden state to kinda move yourself into color space.
And then what you're trying to do is you're given this context and it says,
one, two and target.
So imagine this one is the target here.
That's what you want your agent to pick.
So you could do some kind of transformation in work that I've done.
We've done a Fourier transformation to put these into a more interesting space.
And then use this hidden representation or some statistics derived
from it to create a score for each of these elements of your context.
And then those scores could be fed into a softmax classifier that finally decides
based on all of this stuff which of these things
was actually the target of the description.
So it's a little bit more involved because of
the structured nature of the space on the right,
how to be a listener here.
But it's entirely doable and again colors are cool
because like you could just make some simple assumptions about what these,
uh, representations should be like to construct all the scoring stuff.
And here are a few other things- other ideas.
Look into your question.
So first, just for a framework here.
Even a simple classifier in my terms as a kind of listener, right?
You consume some language like a document or
NLI entailment pair or a sentence about a movie.
And you make an inference into a structured space.
And that's the kind of simplest version of this.
And then the other extreme I would say are the models that Bill is going to talk
about on Wednesday like semantic parsers where you consume some language,
construct maybe a very sophisticated latent representation of that language,
and then make a prediction into a highly structured output space.
And then more keyed into what I just showed you for example is like scene generations.
So the task of mapping from language to structured representation of visual scenes.
And here are some papers that do this.
A common theme that runs through them is,
what you actually do is map from language into like,
uh, an attribute value structure of an image which then some other process produces.
Because you kind of need some of
these aspects of the highly structured output space to get traction.
But even that could be really interesting like in
this work with Angel Chang and colleagues we, um,
went from language to attribute value descriptions that could be fed into Pat Hanrahan's,
um, 3D visual scene generator.
And then it could produce these really rich visual scenes.
And then, this is another important paper.
This is the image Flickr30k paper from Julia Hockenmaier and her group.
And their idea was to learn visual denotations for linguistic expressions.
So one version of this, uh,
of this problem. Question? Yeah?
So it seems like maybe the way that you've,
um, mentioned this for the color, uh,
task is that the speakers and
the listeners are sort of disconnected that we have, you know,
dataset of like [inaudible] says we're training a speaker to kind of match those.
And we also have dataset of utterance colors and we're training listeners to go from that to,
uh, from the language to the colors.
But I feel like there's,
there's also some research right?
Um, just like end to end like a speaker is generating, um,
a message that is directly going to
a listener that then needs to try and recover the initial thing.
Yes.
Does that falls somewhere like-
I think that's next, right?
That would be like a chatbot?
I think if I understand correctly,
if I don't- if the model you have in mind is different from the one I showed you just,
just ping me again.
Um, but I think you're describing a chatbot,
and my pitch here will be that our chatbots should be grounded.
So here's a standard kind of neural chatbot setup,
where you have some agents say, "What's up?"
And it produces like a hidden representation.
Then you have this bridge into another RNN, the decoder.
And this thing is tasked with producing utterances,
and it does it in exactly the way that my basic speakers did in the color space, right?
With the- the teacher training and so forth.
So what's up, not much.
Uh, there's lots of work on this.
A lot of it came out of Stanford.
I think a lot of progress has been made on this problem.
But I do think that you're kind of fundamentally stuck,
if what your model is doing is just reading in lots of dialogue text,
and trying to mimic what is in that text.
I think the lack of grounding is always going to limit
the ability of these chatbots to do interesting things.
Um, especially did make a lot of progress here like,
getting them to be consistent in their utterances,
and kind of getting them to produce
interesting dialogue so that they don't degenerate into just saying,
"I don't know," or "I hate you," which is
a common strategy for these things to fall into.
And actually one of the ways that he overcame some of
these limitations were with what I would call grounding.
So for example, to enforce
consistency so that they wouldn't in one dialogue turn and say that they had children,
and then in the next say they had no children.
Um, he actually created like user embeddings for the different agents.
And that was enough of an anchor to enforce a lot of the consistency that he was after.
But what I'm gonna describe to you here is an even richer form of grounding.
This will be like task oriented chatbot.
And this is grounded in a
really wonderful couple of papers that came from the Facebook AI group,
uh, the first paper is called Deal or No Deal,
it has something that comes after that, um,
they released a dataset of about 6,000 dialogues that
are grounded in about 2,000 unique scenes,
uh, this is the interface.
The gist of this is that you have
two agents that are trying to negotiate on a trade of some objects.
These are the objects in this scene.
And for this agent,
they've been valued differently.
So the book has a value of 8,
the hats have a value of 1,
and the balls have a value of 0. So that's the first thing.
And then each agent is given a subset of them.
So this agent has one book,
one hat and zero balls.
And the agent that they're playing with has a similar,
uh, or may- you know,
a different value function and a different set of the objects.
And the idea is that they want to maximize the reward by via a negotiation.
And it was nicely set up because in every single scenario,
there's a little bit of tension.
So it's not the case that both of them can walk away completely happy.
There has to be some compromise,
and that's what leads to the kind of interesting negotiation setup.
Here's what the data actually look like.
It's kind of confusingly encoded for a human,
but may be easy for your neural model.
So this is the representation for- there,
there are two perspectives; you and them.
This is the you perspective.
And the string of digits up here is saying that you have 1 book worth 0,
4 hats worth 2 and 1 ball worth 2.
And then you get this little dialogue, and then the outcome.
So in the end after this trade,
you have 0 of item 1,
or item 0, sorry.
4 of item 1, and 0 of item 2.
And you get a little boundary symbol.
And this encodes your reward for this negotiation,
and whether or not you agree to it,
and then finally down here that's
a representation of the goal function for the other agent.
These dialogues in the data set that they released are all paired like this.
So this was you perspective and here's them.
It's the same dialogue, but different outcomes,
different goal function, and of course this was the
you one that I showed you before in orange.
It's just been flipped.
So you get both perspectives.
And the reason for that is, because of these different goal functions,
the rewards are different,
and even the agreement can be different.
Does that make sense? And I think I can clarify about the task.
I think it's kinda masterful and it's,
it's got some subtle complexity that
generated really interesting social dynamics I would say.
Here's their model. Oh go ahead.
[inaudible] the other side?
Because I think they started off with a book and they got a book?
You are much faster at this than I am.
I'll double-check. [LAUGHTER].
[inaudible] Got it. Okay.
It's all I can do to remember the ordering of these things up here.
That's why I put these reminders to myself.
[LAUGHTER] Here's what their model looks like.
There are- and obviously there's a huge amount of information,
but I think at a high level,
you can kind of get the gist of what they're doing.
So it's a bunch of RNNs,
they happen to use these GRUs.
So Gated Recurrent Units,
it could be LSTMs.
Um, they have a goal encoder that just
encodes in its own embedding space and for its own hidden representations,
the goal here like, 1,
0, 4, 2, 1, you know,
that's the goal sequence I showed you before like,
object value, object value.
And then this is HG.
This'll be an important state,
because you'll see it in other places in the model.
But just imagine that the agent has consumed that goal,
um, representation and produced HG.
Then you have a dialogue encoder.
And what that thing is doing is just being a kind of chatbot,
so producing language, so consuming and producing language,
uh, and it does that,
HG gets appended to all of these tokens here.
Again, I think that's just like the color thing.
It's a kind of a reminder about what the goal was in this diffuse neural sense.
So it works like a chatbot there.
And then finally for the output encoder,
this is a kind of nested thorn of- a thorny nest of these attention mechanisms.
I think I've captured the gist of it.
So for the output encoder here,
you notice that you have h_1, x_1, h_2, x_2.
That's the conjunction of these representations down here.
They produce their own bidirectional RNN,
which is fed through these attention mechanisms.
HG appears again through the summary repres-
concatenated with the summary representation from all this attention business.
And then that representation is the basis for a bunch of different classifiers,
item 0, 1, and 2,
predicting the number of objects that were obtained after this negotiation happened.
So there's a lot going on here,
but the high level summary would be like,
a neural encoding of your goal,
be a dialogue agent,
and then a neural classifier for the different output states,
for the different items. Makes sense?
They do an interesting thing, which I'll return to.
But just at the high level.
So the dataset is kind of small.
So what they do is they train two agents and then fixed one, fix one.
And then another agent can interact with that first
fixed one and start updating its parameters. Here I've walked through that.
So agent A, that's the one we're training,
it reads in the goal,
it writes out some dialog, and then agent B,
this fixed one responds,
because it hit this end token here.
So the agent B produces some stuff.
And that could continue,
these agents could go back and forth.
And then finally agent A does this classification thing and gets a reward.
And now you could do a reinforcement learning thing of
feeding that reward back through the parameters for A.
And in that way,
even without any more human data potentially,
agent A could get updated.
And maybe get better at this negotiation task.
And here I mean, uh, in principle,
agent B could be a human,
uh, helping to train the system.
But I think they just use their pre-trained one.
Another really nice insight from the paper is that
even this agent here is not especially well-grounded.
Certainly this one is just kind of
operating in the world of language and representations.
But it's not making use of the reward.
Here you see the reward come in.
And then what they did is this thing that they call dialogue rollouts,
so in this case,
these agents make predictions about how the negotiation is gonna go.
Get a reward signal,
and then that's fed back through,
and it helps them make local decisions.
In the way of like,
uh, Markov decision processes.
So like here I am in my state,
make predictions about the future, get a reward,
do that for a bunch of different times and pick the, the state transition,
that is like the dialogue move,
that will give me the highest expected reward from this rollout.
And that's really nice, because then you're making very rich use of
the rewards for these local negotiation decisions.
That's my fast high level summary.
I do encourage you to look at the paper for more details.
There's just one more thing that's fun that I wanted to say about this.
So there was an amusing media narrative.
You guys might have heard about this paper without even
realizing so far that you've heard about it.
So this is the relevant passage.
"During reinforcement learning, an agent A attempts to
improve its parameters from conversations with another agent B.
While the other agent B could be a human,
in our experiments we used our fixed supervised model that was trained to imitate humans.
The second model is fixed as we found that updating the parameters of
both agents led to divergence from human language."
Do you guys remember this now?
Okay. So then it had a blog post,
which is also pretty,
you know, pretty modest about this.
The second model is fixed,
because the researchers found that updating the parameters of
both agents led to divergence from human language,
as the agents developed their own language for negotiating.
Okay. So then Newsweek.
"The bots ran afoul of their Facebook overlords
when they started to make up their own language to do things faster.
Not unlike the way football players have shorthand names for
certain plays instead of taking the time in
the huddle to describe where everyone should run.
I don't know about the football analogy,
but at least the first site part of this is pretty clear sighted.
It's not unusual for bots to make up a lingo that humans can't comprehend.
Though it does stir worries about these things,
that these things might gossip about us behind our back.
Facebook altered the code to make the bot stick to plain English.
I think that's still pretty good as a report about what they did."
This next one is different. Tech Times.
"Facebook was forced to shut down one of
its artificial intelligence systems after researchers
discovered that it had not star- that it had started
communicating in a language they could not understand".
And it gives examples.
And the examples are like,
the bot saying, "I'm me me me me me me me me Ball."
That's the language that they had discovered on their own.
[LAUGHTER] Which, I mean,
it is kind of interesting to think that for
these agents that might be a meaningful negotiation move,
and maybe you're creeped out by that.
But I think this is kind of overblown.
It doesn't stop there.
The incident evokes images of the rise of Skynet in the iconic Terminator series.
Perhaps Tesla CEO Elon Musk is right about AI being the biggest risk we face.
I mean, you've seen the model.
I think it's a little bit overblown.
[LAUGHTER] Back down to Earth.
If you're interested in these problems,
here's a bunch of other datasets.
Uh, some of these are pretty hard, um,
but there are also instances of doing
task-oriented dialogue and I think you can have a lot of fun exploring with them.
Okay. Last phase of what I wanted to do.
I can at least give you a flavor for this.
This is what I've called reasoning about other minds.
Here's my little motivating example,
which will remind you of Grice,
if you've read Grice.
So they have these two agents here,
and just imagine that they're referring to these two cylinders.
So this agent is going to say,
"The blue one please," and it has in mind this one on the left, this agent.
The idea is that this agent is thinking,
my listener knows that I'm cooperative in the Gricean sense.
And what that really means is just that, um,
I'm trying to communicate in an open and honest way,
and my listener knows that.
And we kind of collectively know that both of us can leverage information
about the context and what we know about being human and so forth,
in order to figure out what we mean.
So they'll be able to work out,
that I mean the unmarked blue one.
The idea there is that since you just said blue and not something like baby blue,
that it must be the unmarked form of blue.
And then if you're the listener you think, "Wait a second.
The blue one, that's ambiguous."
But I was assuming that the speaker was cooperative.
That's this kind of collective social thing.
Uh, but if I assume that they have- would have picked a marked form like baby blue,
if it were true, then it must be that they mean the unmarked form.
Like, if they could have been more informative in that way with their utterance,
they would've been and they weren't,
so I'll default to assuming they mean the unmarked one.
And in that way, they get this kind of mind meld around this one.
And the idea is that this would be completely different in this context, the blue one.
This is the same color blue that I had over on the right here.
But now the blue one would just obviously refer to this one. No confusion.
So that's the kind of dynamic that we're trying to capture.
All these thought bubbles,
the idea is that if our systems could do this at some level,
then they would be more efficient in communication.
And here's a simple model for doing that.
It's called the Rational Speech Acts Model.
It was developed by people here at Stanford.
And it has a kind of hierarchy of three probabilistic agents.
So the literal listener here, observes a message.
This is like the listeners I showed you.
And we're assuming some language that they speak, Lex.
And on the basis of that message in that language,
makes a guess about what the world is like.
And they do that based on a prior over the world, P(w).
And also just this core lexical information,
which you could think of as kind of just you-
your knowledge of the language that you speak.
So that's the literal listener.
The pragmatic part about this is that we have a pragmatic speaker.
Now, speakers in this model observe the world and the language,
and then they produce a message,
just like the speakers that I showed you before.
And the idea is that this speaker does this based on the literal listener,
that's the purple here,
and some information about how costly different messages are.
How marked they are or something.
But the important thing here is that the speaker
is not just reasoning directly about the language,
this Lex, but rather about a listener- literal listener who reasons about the language.
So that's that one layer of indirection.
And already there, you get
a richer communication system than you would've
had if you were just reasoning about the language.
And then finally, this pragmatic listener,
here you have two layers of indirection.
So it's a listener,
it observes a message and makes a guess about the world,
but it does it by reasoning about the prior over worlds,
and the speaker who's reasoning about the literal, literal listener.
This back and forth is kind of mirroring those thought bubbles,
where it was like, "I'm a listener,
and I imagine I'm a speaker.
Oh, and then I can work it out on the speaker side as well."
You can reorient this model however you like.
But the important thing here is that,
if you're a pragmatic listener or pragmatic speaker,
you don't reason just about the conventions of your language,
you reason about agents who are using those conventions.
And from that, you get additional meaning.
Here's a simple example.
So I have just a very simple case here.
We have two messages beard and glasses.
This is David Lewis with his magnificent beard.
He was a philosopher.
And this is Paul Grice,
he doesn't have a magnificent beard at this time in his life,
but they both have glasses.
So true, true, true, false.
At this level, if you, uh,
were a listener and you heard beard or glasses,
beard would be unambiguous for you right away,
but glasses is true of both these entities.
So if you heard it, you would have to make just a random guess
about which of these two entities the speaker was referring to,
if you were a kind of mechanical semantic, uh, listener.
If I perform that speaker calculation,
that speaker agent that I just showed you,
already now, surprisingly I have an unambiguous system.
If I want to refer to Lewis,
I'll probably say beard.
And that follows from the fact that if I want to refer to Paul Grice down here,
my only option in this constrained signaling system is glasses.
And that creates this kind of separation.
And then when we go one layer further, now beard,
that unambiguously refers to Lewis,
because Grice doesn't have a beard.
But notice that that 50-50 is now 25-75.
A lot of bias in favor of hearing glasses and thinking it's Grice.
And that is just like that, uh,
dynamic I showed you before where you kind of were literally just
mathematizing the idea that if I wanted to refer to Lewis,
I could've done it with beard,
and since I didn't, you must mean glasses.
You must mean Grice. And the model has more or less
just turned that into a little probability Bayesian model of that reasoning.
That's a wonderful model. It's been incredibly successful in linguistics and psychology.
Lots of good experimental evidence that people are doing something like this calculation.
I think a lot of problems in pragmatics were solved quite neatly by this model.
But if you wanna think like a computer scientist, it's pretty limited.
Because in all that work,
you have a hand specified lexicon,
and then when you perform that speaker calculation,
if you take the model literally,
you have to reason about every possible alternative message they could have produced,
which for a language like English,
is not even a finite number of things, presumably.
And then it's a very high bias model,
because I hand specify this, and there's no learning.
I don't have many chances to learn that.
For example, people love to say beard for David Lewis,
because his beard is so visually salient.
The colors in context
corpus is a setting in which we can think about this as a machine learning problem.
It's too large for a hand specified lexicon,
but maybe we can think about the speakers and listeners I showed you before,
and whether they could solve this problem.
So here's how this might go.
The literal neural speaker is the one I showed you before.
Let's call that S_0.
It just produces the context,
sorry, consumes the context,
and produces a description.
Presumably, of the target here.
The neural literal listener is the one that I showed you before.
It consumes language, and together with the context,
makes a guess about what the referent was in that context.
Those are the literal agents,
the ones that we would have hand specified in the previous mold.
Now we just have these kind of black box neural models doing this work for us.
And then in the simplest version of this model,
you could just apply that RSA calculation on top of those two agents.
So here's the speaker.
It's gonna depend on that literal listener,
which is the neural listener I showed you.
And this is kind of cool.
Instead of having to worry about all of
the conceivable utterances that are alternatives to one,
the one that was chosen,
you could just sample from that speaker,
um, and use that as an approximation of this denominator here.
Once you've done that, your problem is solved,
the pragmatic listener is just an application of
the RSA calculation on top of this S1 here.
So in that way,
you've got these pre-trained models,
they're done, you apply the RSA calculation.
And presumably, those agents can do better at this task,
because they're effectively reasoning about each other.
And I'm running out of time here,
but I wanted to mention also that another way of solving
this problem here is to just change the space of atomic messages,
and that's what my student did.
So he did what we call pragmatic image captioning,
which is like that work I mentioned from the Google groups before.
Um, and here he applied RSA to a standard image captioning system,
and he actually instead of sampling from S_0,
reasons that- applies the RSA reasoning step at the level of individual characters,
and shows that that leads to better and more informative captions
in these contexts than alternative simpler models.
And that's like a wonderful instantiation of this idea,
this intuition that for informative image captions,
I should be thinking about the context I'm in,
and trying to produce a caption that is distinguishing my target from the rest.
And here's a bunch of other work.
Some of it with different kinds of corpora associated with it,
some of it from robotics,
some of it from semantic parsing,
lots of datasets, some newer developments down here that you might wanna pursue.
I think this is a very exciting space.
And then here are a few other corpora.
TUNA is a classic one,
that's like those philosophers I showed you.
I- don't ask me why it's called TUNA,
I can never remember.
Uh, SCONE is a newer one from Stanford.
And then Robert Hawkins actually released a crowdsourcing engine
that would allow you to collect your own reference games if you wanted to.
For this final step here,
I just wanted to gesture at a few ideas.
I know we're out of time, so I won't go through them in detail.
But like here's just,
again the simplest case of this sarcasm detection.
There's a wonderful book Corpus for this called SARC.
Sarcasm is a classic case of a problem
that you will not solve with looking at text alone.
It is not a fact about, uh,
a statement like, "Great idea,
that it is sarcastic or not."
It is sarcastic when used by certain people in certain contexts.
And this SARC corpus is derived from Reddit.
It's an opportunity there because you could learn user embeddings, form embeddings,
immediate dialogue context embeddings,
and show that all of those things are influencing the decision
about whether an utterance like great idea is sarcastic or not.
And in this deep learning mold, that's easy.
Just add more representations into your model and see if you can learn them effectively.
And then I pointed out a few other cases.
This is an example where you could do something similar,
but in the context of social graphs,
which would be another kind of grounding,
kind of keying into your intuition earlier about knowledge graphs.
PLOW, this is wonderful work from James Allen
where the context that you're in is a webpage,
which is very highly structured.
Uh, it's just another way to think about how you can get beyond
the limitations of language by actually having a rich representation of the context.
Here it's more symbolic.
But there are lots of webpages out there.
So there's no shortage of these structured contexts.
And then here's some work that I did with Adam Vogel,
that is more in the mold of like, um,
old-style decision process models.
And here I just wanted to point out this is- this is just wonderful.
So just give me one minute.
So we had these agents play this collaborative game,
where the- their goal, it's kind of hard to see,
but their goal was to just be co-located with this card here.
They succeeded if they both found it in some sense.
And they could communicate,
and we didn't color code anything about their incentives to actually communicate.
We just said, "You have this ability to,
uh, understand language if it comes in,
and to produce it if you wish," where speaking was an action,
just like moving around in the board was essentially.
And these agents get trained over a long period,
after just a few hours of policy exploration,
which is like training for these models.
They would do funny things like,
this agent found the card,
but what it said was top,
and that sent the other agent off toward the top.
You can see it has a high probability for thinking that the card is at the top.
It's a very bad state for them to be in.
They learn pretty quickly from this kind of grounding,
together with their reward function that they should not use misleading language.
So after a week of policy exploration,
now they produce true utterances.
And about that same time,
they realize that it- there's a reason for them to produce utterances.
Because if they do,
they can quickly influence the actions of
the other agent and they solve the problem quickly.
And since these agents are set up where every moment for them is misery,
all they wanna do is find this card so that they're freed from this misery.
Um, they pretty quickly learn that they should produce utterances,
and actually true utterances.
And that's a kind of Gricean thing that we claim was emerging from their grounding,
in this particular goal,
in this particular environment.
And I thought that was very exciting.
The problem with these models is that they're really intractable.
And I think they're just waiting for, um,
you know, the day when they get replaced by neural approximations,
which might be less crisp in terms of what they learn for policies,
but more able to deal with changing environments,
and complicated environments, and so forth.
Uh, here I think that kind of keys me up for these next frontiers here.
So what I expect is much deeper integration with devices and the environment,
more sophisticated reasoning about goals,
better tracking of dialogue history,
and as I said just now,
approximate states- state representations to address these scalability issues.
I think progress here is gonna lead us to agents that are truly grounded,
and that will lead us to agents that are truly
able to use language in a sophisticated way.
Okay, thanks.
