HOST: Thanks,
everyone, for coming.
We are very honored to
have Kevin Murphy here
as our speaker.
I've known Kevin for
a very long time.
I was just reflecting
back on when I met you.
Do you remember?
So I was in grad
school, and Kevin
was doing a summer
internship, I think,
at the DEC lab, or something.
Anyone heard of
that company, DEC?
They made a search engine
called AltaVista, briefly
survived until they were
overtaken by another company
that you might have heard of.
[LAUGHTER]
But Kevin was doing a
summer internship then.
And I remember going
to dinner with you
at Daddy-O's, and sitting
out in the summer.
And I-- you know, instantly--
I mean, though I was in
brain and cognitive science,
Kevin was in computer science
and AI, I felt a connection.
And again, I think you'll see
this is computer science work.
This is-- he's going
to talk about machine
learning, computer
vision, and related things
in natural language.
But Kevin is a fitting
person to have here
for the Center for Brains,
Minds, and Machines
because he's always
been interested,
I think, in getting computers
to do the things that humans
do and is one of the most
interesting and best people
at doing that.
He has many positions he's
occupied since that summer.
He was-- he finished
his PhD at Berkeley.
He was on the faculty
at UBC for a while,
where he started working on
a machine learning textbook
that has become one of the--
I don't know if we want to call
standard-- it's certainly one
of the best books in the field.
It received the DeGroot
prize from statistical--
books in statistical science,
which is very impressive.
And I think it's one of
the books that best--
I'd say it's the book
that, to me, makes
the most interesting connections
between the whole broad swath
of machine learning and a
range of different ideas
and statistics, Bayesian,
otherwise, and so on.
He moved to Google
a few years ago
and has been leading the
group there in the-- well,
in exactly the area that you're
going to hear more about.
Oh, and I guess he recently
became editor-in-chief,
co-editor-in-chief of the
Journal of Machine Learning
Research.
That's super impressive.
[LAUGHTER]
So, anyway-- so there's
really nothing more
that needs to be said.
We're just extremely honored
and pleased to have you here.
So Kevin, take it away.
KEVIN MURPHY: Great.
Thank you.
[APPLAUSE]
That was a very
flattering introduction.
I hope I can live up to it.
I'm very excited to be here.
It's always fun to come to MIT.
There's so many interesting
things going on.
HOST: Yeah, you were
a postdoc at MIT.
KEVIN MURPHY: I
was postdoc, yes.
So I was a postdoc
here, 2002 to 2004,
with Bill Freeman and Leslie
Kaelbling before I went to UBC.
So yeah, I remember MIT fondly,
and I always enjoy coming here.
So like Josh said, I-- so I've
been at Google five years,
actually.
So about three years ago, I
went to the head of research,
who's a guy called John
Giannandrea and I said,
you know--
so this is how--
I'll tell you a little story.
This is how Google Research
is set up currently.
There are literally
three towers.
There's the NLP tower,
which is sort of
run by Fernando Pereira.
There's the computer vision
tower that's run by Jay Yagnik,
that I'm in.
And in the middle is the
machine learning tower
where Google Brain
is, run by Jeff Dean.
And everyone talks to the
machine learning tower.
Everyone-- you
know, deep learning
is used for everything.
But there wasn't
so much crosstalk
between the vision guys
and the language guys.
And it feels like these
should be interacting.
I don't think we need language
to build intelligent machines
because animals are
intelligent, and they do things.
But if we want to build devices
that interact with humans,
then we do need language.
But we don't want it
to just be language
about the web and
abstract things.
We also want to talk about the
physical environment and ask,
what is he wearing, and why--
oh, if he put his coat
on, is it raining outside.
And those kinds of
grounded things.
So that's a space
I'm interested in.
So I pitched this idea to
Jay G. About three years ago.
And I said, I think we should be
doing something in this space.
I think this is
interesting, and I
think it's going to
be useful to Google.
And you know, back then,
Google Home didn't exist.
There were no devices on
the market that needed it.
But he believed me.
So he said, OK, go ahead.
Build a team, do something,
or at least start
to try to do something.
So I spent a long time
trying to think of a name.
And I settled on VALE.
It's a pretty broad
title for my team--
Vision, Action, Language,
and Environment.
And I've been super fortunate
to grow the team to 10 people,
with an 11th person joining us.
So Carl Vondrick, who
some of you may know,
is starting next
month to join us.
And we've been
looking at, basically,
mostly computer vision and so,
specifically, object detection,
person detection,
and pose estimation,
and what I call
dense predictions,
so like mapping one
image to another image,
like 2 and 1/2-D vision,
depth estimation,
semantic segmentation,
colorization, surface normals,
optical flow, a little
bit, those kinds of things.
And I'll briefly summarize
some of our work in that space.
And then, I've always
been interested
in the language side.
We haven't done a
lot of work there,
but I have a few little projects
here and there that I've
done mostly with interns.
Jonathan [? Mahmud ?]
was my intern.
And I won't talk
about his work today,
but that was one example.
So that work is kind of fun
and is a bit more cog sci
science-y, so I'm going to
talk about that, as well.
And then, very
recently, we've been
starting to dabble in
reinforcement learning.
And this is sort of
growing in popularity.
I deliberately
did not include RL
in the first version of my
book because, you know, even
only five years ago, it was
still just [INAUDIBLE] worlds
and wasn't being used.
And there's been a lot of
progress in that field,
as you may know.
So when-- I'm hoping--
I am working on v2 of the book.
It's a work in progress.
I don't know when
I'll be finished.
And there will certainly
be some coverage of RL.
And so you'll see RL creep
into this talk a little bit.
So it's pretty broad.
I have-- hang on.
Let me just do the
table of contents.
I have a whole slew of
things I'd like to cover.
I'm planning to go over the
perception side pretty quickly,
maybe in like 15, 20 minutes--
maybe, let's say, to 5:30.
And I-- you know,
you can interrupt me
if you want more
detail and stuff.
But then I'm going to
slow down a little bit
on the second half, which
is about sort of connecting
the vision to the language
side, rather than just
the pure vision
side, just partly
because it's more recent.
And I think it might be of
more interest to BCS folks.
Let me backtrack, though, and
just show the slides I skipped.
So that's the team.
I'll call out names
of people as and when
I get to specific projects.
So-- but this was sort of--
this is the landscape
we're in today, right?
Why try to connect
vision and language?
So I'm-- in this talk today,
I'm mostly focusing on static
images, single static
images, which is--
the simplicity is not really
the real world that we're in.
But there are
artifacts of this form.
People take lots and
lots of photographs.
We'd like to interpret
these photographs.
And the language
comes in because we
want to annotate them so we
can retrieve them and describe
them, maybe to visually
impaired users.
If we could generate image
captions automatically,
that would be pretty
helpful to a lot of people.
And so I'll talk about some
methods for doing that.
Of course, you
know, in the future,
we'd like to look
at video analysis,
both in an offline setting--
you know, discover
interesting facts
about what happened
in your surveillance
camera or your
biological vision system,
watching mice in a
vivarium or something.
And then, even more interesting
is an interactive setting where
you've got streaming data.
And this is obviously the
closest to biological vision,
which is situated in real time.
And so this is sort
of the spectrum
that I'm hoping that
will move in my team.
And obviously, other people
are on different points
in the spectrum.
But today I'm going to focus
on this left side, which
is sort of the classical
computer vision, where we've
got web images, and we're going
to try to squeeze juice out
of them.
So I'm calling it deep
understanding of single images.
And here-- so person
detection and object detection
is an example.
Captioning is another.
So deep understanding--
well, aren't we done?
I mean, look at this figure.
Everyone knows the image
in that challenge, right?
So the error rates have been
going down and down and down.
And look, we're
better than people.
Oh my god, we can just
quit and go home, right?
Well, no because
image classification
is not image
understanding, right?
I mean, obviously, everyone--
I assume most people in
this room agree with that.
But that statement is not
the default assumption
if you go to a computer
vision conference.
Certainly, some-- many of my
colleagues who I won't name
think classification is--
that's it.
If we can do better
at classification,
then we're making progress.
Well, that's-- there are many,
many other things we want
to get out of images, right?
So we would like to
do stuff and things,
as Ted Adelson says, right?
So stuff-- the dense things, the
surfaces, the surface normals,
the semantic segmentation into
categories like grass and road
and sky.
And then the things
that are countable,
that may be individualized,
that could, perhaps,
move-- people and dogs
and individual trees,
if we're interested in that.
So you know, you could move
from one category-- it could
be instantiated into
instances, if you
care about counting trees or
detecting the disease once.
But if it's a group of
trees, it becomes stuff.
And then people, of
course, is a special case
because they're animate,
and they're, obviously,
very relevant for us and also
for other devices that we build
and artifacts that we build
because we want to make
stuff that's useful for people.
So we want to be able to
detect people and imagery
and estimate their
pose and maybe
use it to predict what
they're going to do,
understand what they're
doing, and so on.
So these are the kinds of
image interpretation tasks
that we'd like to
be able to solve.
So I'm going to go
over, very briefly,
like I said, some
of the techniques
that we've applied
to these problems.
And the basic hammer
that we're using
is the hammer that everyone
is using these days, which
is neural networks, in
particular, convolution neural
networks, in particular
deep convolution neural
networks, which are just
function approximators that
map from, in our case,
RGB images to some output
that we'll specify.
And we are going to assume--
so everyone says,
OK, why is the field
making so much progress--
because of deep nets?
Sure.
Deep nets are very,
very, very helpful,
but we need a way to make them.
So I think software
plays a big role.
The fact that there are
systems like TensorFlow that
makes it easy to create these
models and train them at scale
is a game changer because
these things are very data
hungry and compute hungry.
And then the other
thing that people
tend to not mention as
much is labeled data.
It's not just data,
but it's labeled
data, labeled by people.
This has driven
the field forward.
And it's a drug
we're addicted to,
to quote Jitendra Malik,
which is clearly biologically
implausible and is
also unsustainable,
from an engineering
point of view.
So I'll talk a little bit about
how to try to reduce our--
wean our addiction to this.
But in the first
half of the talk,
we're just going to inject the
morphine and enjoy the ride.
So we're going to
download the COCO data
set, which is a public data
set not created at Google.
It's not particularly big.
It's-- let me read
the statistics here--
so 125k images.
But the key thing
about COCO-- it
stands for Common
Objects in Context--
is that it's very
densely labeled with lots
of juicy kinds of annotations.
So we have instance masks.
We have bounding boxes.
We have people key points.
We have captions.
There's all kinds
of layers of stuff
that have been added
on top to this.
So it's a very
useful data source.
That's why-- making a lot of--
moving the needle in the field.
So let's try to tackle
some of these tasks.
So maybe starting at the bottom
of the stack of these sort of--
it's not exactly
pre-semantic because I
have the word "semantic" in the
title, but it's dense, right?
So the input's an image,
and the output's going to be
an image-shaped thing,
where, in this case,
in the task of semantic
segmentation, the goal is--
assign a pixel to every--
sorry, a label to every pixel.
So we have a finite number
of categories, like 20 or 80,
depending on the training set.
And we have labeled pairs that--
this is somewhat expensive
to acquire this data.
But if you can get it--
and at this point,
you know, there
are several data
sets of this form--
now, it's just a standard
supervised learning problem.
And you know, the
questions to be answered
are really what's the sort
of form of this network
and then how do we
train it efficiently.
So the form-- you know, CNNs
for classification problems
have this sort of funnel shape.
They start with a big image,
and they squeeze it down,
and they predict a small
number of outputs--
20 or 1,000 labels or
something-- so they
have this bottleneck shape.
But in these dense
problems, the output's
just as big as the input.
So a very common architecture
is this hourglass, where you
shrink, and then you expand.
So you are doing
convolution initially,
and then you're doing
deconvolution to go back up.
So the problem with
that is that you're
losing a lot of information in
this bottleneck in the middle.
So you've thrown away
a lot of the signals.
So basically, there's a
whole rash of papers--
I've listed three here,
but there are many more--
where people try to recover
that lost information.
In a supervised
setting, it's not so bad
because you are given a
very high resolution input.
So you can do these skip
connections, basically,
that will sort of copy some
of the high resolution input
from the beginning all
the way to the ends.
And then the network's adding
the sort of semantic layer
on top, which is typically
lower resolution, anyway.
And then the network learns
how to fuse those two.
So I'm not going to
go into these sort
of low level architectural
details of the networks
because these things
change rapidly.
And it's not clear what the key
principles are because there's
still lack of consensus.
But this sort of high-level
idea that you want to pass
information at multiple scales--
I mean, that's been
around for a while.
And that's driving a
lot of these things.
So they're often
called U networks
just because people
draw them like this.
I'm going to call them
conv-deconv networks.
But anyway, all of-- you can
use one of these architectures
that you like.
They have access to the
high resolution input.
But nevertheless, the
outputs that these models
are predicting tend
to be kind of blurry.
So our contribution to
this space was to say,
well, you know, there
are these methods
that people did five years
ago, all that time ago,
called graphical models.
And they have some
nice properties
for modeling correlation
between random variables
that the neural network doesn't
explicitly capture, right?
The outputs are predicted
independently per pixel,
conditional on the hidden
states of the neural network.
But nevertheless, there's
no explicit correlation.
Whereas, you can model what
they explicitly will say
are conditional random fields.
So in particular,
there's a paper--
let's see if I can
get the reference--
from [INAUDIBLE] 2011, and they
showed that you can actually
capture long range
correlations between pixels
if you make your
graph fully connected.
And you can still do
efficient inference
in such a model using
mean field algorithms.
But back in 2011, people were--
so basically, I'm not going
to go into the details,
but probably many of
you know, with a CRF,
you have to model the-- say what
the correlation structure is,
but also what the
local evidence is.
Like, locally, what you
think this picture should be?
What category?
So you know, back then, they
were using random forests,
and so on.
It's sort of a
low-hanging fruit.
You just plug-in a
neural network instead,
and you'll get better results.
So we take one of these neural
nets that's trained end-to-end,
and then we're going to
feed that into the CRF,
and that's going to clean up
some of the high resolution
edge information that
was lost by the network.
So that's sort of the key idea.
It's called DeepLab.
So the two primary
authors are Jay Chien,
who's at Google in LA,
and George Papandreou,
who's also at Google in
LA, who's on my team.
Iasonas Kokkinos is a
colleague of George's.
He's a faculty--
I'm not sure where now.
I think he's at maybe UCL--
and on Kevin-- sorry.
I'm reading my own name.
Who's this guy, Kevin Murphy?
He's just a free rider!
[AUDIENCE LAUGHTER]
Alan and I were playing an
advisory role on this one.
So there's sort
of two key things.
One is the CRF component.
I mentioned the other is this
idea of a true convolution,
which is essentially
expanding the spatial support
of your filters in
an efficient way
to capture a long-range
correlation without blowing up
the computational costs.
And this is an old idea
from signal processing.
I learned this from George,
who has a signal processing
background.
Recently, Vlad
[INAUDIBLE] rebranded
it dilated convolution,
and George got mad, saying,
why invent a new name
for something that
already has a name?
This is French for "with
holes" because you're
putting holes in your filter,
but you're not actually
multiplying by 0.
That's the key trick.
It's very simple.
Anyway, this is the part
I'm more familiar with.
So in the CRF, if you just
use a nearest neighbor grid
structure, you don't
really get any juice
because the neural
network, the conf net,
is already capturing short-range
local correlations by virtue
of having correlation filters.
So you need to use
these models that
can exploit long-range
connections to get any win.
So this is kind of
what it looks like.
You've got this network
predicting, per pixel,
the probability of
each of the categories,
like the softmax heatmap.
And it's somewhat blurry.
And then you feed it
into this network,
and you get a nice,
sharp output coming out.
And inference in this network
is an iterative process.
And I think I cut all these
slides just to save time,
but you can implement that
iterative process as an R and N
if you want.
And it's actually implementing
the mean field equations.
And after a few iterations,
it will converge,
and you get nice wins.
So here's just some eye
candy, input image predictions
from the neural net.
And the you stick it--
stick the CRF on top.
And you can see it
sharpens up the results.
So you know, it's
not qualitatively--
in some cases, it
is flipping labels,
so it can suppress
some false positives.
But primarily, it's
sharpening up the edges.
And that gets you some gains in
terms of the standard metrics.
So roughly speaking,
you're going from,
in the pre-CNN era--
so the same CRF, but we say deep
random forest, or something.
It's like 50%.
CNN's come along--
you get 10% gains
just by using a neural net.
Everyone's happy, and then
you stick a CRF on top,
and maybe you get
another 5% or so.
And you know, the really
nice thing about this
is it's just a black box.
And you can just throw
different data sets at it,
and it will learn
different mappings.
So this is-- here, the
labels are different.
So it's labeling
parts of objects.
And now we can train
it on urban data.
This is the cityscapes
data set from Daimler-Benz
and a group in Germany.
And you can see the
application, the relevance
to self-driving cars, which
is not something I work on,
but this is just the data set.
We tried it, and we
get good results.
Not state-of-the-art-- there
are people who have beaten us--
but pretty good.
And you can train it on-- to
detect take parts of people.
And so this is a pretty
generically useful thing
to have.
Now, one thing we've noticed--
as you all know, the
trend in the field is--
make my networks
deeper and deeper.
And so you think
you're doing well,
and someone just adds another
10 layers, 100 layers,
and they beat you.
And you start crying like
the girl in his photograph.
So we had to do
something similar
when Jay Chien wanted to
make a journal version
of our conference paper.
So by then, the field
wasn't staying still.
So we had to use a
better model underneath.
We had been using VGG, which
is a network from the Oxford
group, Zisserman's group
at Oxford, with 16 layers.
And then ResNet-101 came
along, and it was from-- much,
much better.
So we just swapped out VGG
and replaced it with ResNet.
And then everything got
better, but the relative gain
from our CRF started to shrink.
So you know if you stick
a CRF on top of VGG,
you gain about 2 and
1/2% by this metric,
intersection of a union.
And if you stick a CRF on top
of ResNet, you still get a gain
but it's now 1.3%.
And you can kind of
see the trend here.
So we didn't even bother
porting the CRF code
because it's a little
tricky to implement.
So we just-- it's just not worth
the engineering complexity.
But the upside of
having the simpler thing
is you can use this model, not
just with different data sets,
but it's easy to modify.
I think I cut these slides.
But you can have the same
thing predict not just
semantic segmentation,
but, say, depth per pixel
and maybe surface normals.
And then you can have
one model predict
all of these things
at the same time.
So we call it the master net.
And it's just different
output heads for your CNN.
And so you can--
it's pretty efficient and you
can make it run on the phone.
And it has lots of obvious
applications, some of which
I'm not allowed to
talk about, so I
decided to just cut that
whole part of the talk.
But that's just a
little snippet of what
we've been doing in this space
that I call dense prediction.
OK, so let me move on.
So let's move on from
dense to sparse output.
So where there's a small number
of things you're trying to get,
but you don't know how many.
And you want to find out
how many things are there
and tell me some properties.
So we're going to
represent things initially
by boxes, bounding boxes.
So this is called
object detection.
We want to find some
categories you care about.
We want to localize them
and put a box around them.
And Tommy Paggio is
doing pioneering work
on this [AUDIO OUT] with SVMs
and sliding windows, and now
it's [? convenance, ?] but
it's not that different.
So there's tons
of applications--
there are some eye
candy from applications
of other teams at Google.
This is the-- I'm not sure
where Tommy's sitting,
but I mentioned Google.
We were talking about the
Google X robotics team.
Oh, there you are.
So [AUDIO OUT] from Paul
[INAUDIBLE] at Google X
robotics.
And so the algorithms I'm
going to describe to you
are actually used by
all of these groups.
So that's cool.
It's very satisfying
to see it being used.
So one-- basically,
we got into this game,
like, 18 months ago, when
the whole company [AUDIO OUT]
to TensorFlow.
We thought, oh, it's time
to not just re-implement
old algorithms, but let's
actually update the tool
chain to use new algorithms.
So we looked at the literature,
and said, OK, there's
these ones that have been
winning these competitions.
They all have
different acronyms.
There's SSD and FRNC,
blah, blah, blah,
but they're all very similar.
And it's like
convergent evolution.
So basically, you have
some convolutional block
that's extracting features
densely across the image.
And now, essentially, it's
just like a sliding window
classifier, right?
For every patch,
you're predicting
what the label of
that patch is and what
the coordinates of the box are.
And of course, the patch could
be background, in which case
you don't predict the
box, so you're only
trying to predict
its location if it's
a non-background category.
And the difference from the
early sliding window methods
is, A, we're using neural
nets instead of SVMs.
But more interestingly,
we don't have
to cover space quite as
densely because the network can
be-- you can sort of
classify each patch coarsely.
And if you think
it's a hit, then you
can learn a regression
offset which
will fine tune the location.
So that will give you
sort of sub-sliding window
accuracy, which-- so that turns
out to be quite efficient.
So mixture model-- you're
sort of tiling space
with a finite set.
And then you learn regression
within that finite set.
So anyway, the SSD,
what it does is
it predicts the box
location and the label
in one shot, single shot.
But then an alternative would be
this method called faster R-CNN
where they first predict
the box location,
and they don't know what it is.
They just know it's something.
It's like a generic
box proposal.
And then they extract features
from inside of that box.
And then they try to
figure out what it is.
And the nice thing about this
approach-- it's more accurate,
but you can have any
kind of output here.
It could be predicting box
coordinates or other signals.
And I'll give some
examples later.
So anyway-- and
then there's RCN.
It's another paper from Ross
Girshick and colleagues,
which is a variation where
you compute your features.
The second-- this output
head is sharing features,
so the final layer
is very efficient.
It's just sharing more
features as an efficiency,
speed up, primarily.
So anyway, Jonathan
[AUDIO OUT],, the tech
lead on my team in charge
of object detection, he
and some colleagues devised this
nice API that sort of captures
all of these models and more.
And you know, we implemented
all of this in TensorFlow.
And it enabled us to sort
of try all these methods
and see how they compare.
And they all have
various knobs, which
I'm not going to get into,
that let you sort of trade off
speed and accuracy
as this is just--
this is a very quick model.
It's using something
called MobileNet,
developed by some
of our colleagues
on the Mobile Vision team.
I forget exactly the
running speed, but you know,
it's a few hundred
milliseconds per frame--
very lightweight.
And then this is
some heavy thing
that is picking up
on small [AUDIO OUT],,
like the kite that
was missed here,
and gets rid of some
of the false positives.
So it's clearly
getting more people.
And this is a
kite, not a person.
There's nobody-- it's
not someone windsurfing.
So you are getting
gains, but you
are paying a price for that.
So now we have this
sort of [AUDIO OUT],,
this toolbox we're able to--
or this factory, really.
We're able to sort of mint
tools from the factory that
span this spectrum.
So we have a CVPR
paper this year
where we exhaustively
sort of spanned
the space of these models
and tried wiggling knobs
to make this trade off
curve between speed, which
is on the horizontal axis, and
accuracy, on the vertical axis.
And we wanted to find,
OK, who are on the--
what's on the frontier, right?
So we have these
critical points,
which are the models
that strictly dominate
the ones below it, at least
empirically, on the data
sets that we tried.
And [AUDIO OUT] sort
of say, OK, if you
don't have a lot
of compute, maybe
this is the model
you should use.
If you don't care about speed,
this is the one you should use.
And if you're in--
this is sort of the sweet spot.
And so this is a
pretty useful paper
for people who
work in this field.
And then we said,
OK, well we have
[AUDIO OUT] different
models, and we can easily
ensemble them together and
that gives you a nice big win.
And using this
model ensemble, we
won the COCO detection
challenge last year.
And we actually won it by
a pretty healthy margin.
So our final score,
I guess, was 40--
whatever that is--
41 or 42.
I can't remember anymore.
And [AUDIO OUT] a
fairly large gap
relative to the
second best, which
is the team from Microsoft,
and then various other teams.
So we're pretty happy with that.
So that was a very nice
outcome-- a lot of work,
of course.
So that's great.
But this is-- we're just
getting started, right?
Bounding boxes [AUDIO OUT]
a crude approximation
to the shape of objects, so
we'd like to actually get
a more fine-grained outline.
So I mentioned already
semantic segmentation,
where we-- where, if you say the
categories are table and chair,
that would group
these chairs together.
It doesn't know that
they're individual chairs.
But if I want to
count chairs, then I
need to say not only
that it's a chair,
but it's chair one versus chair
two versus chair three, right?
So that's the difference.
So you can think of that as--
well, one way to tackle that,
the way that we tackled it
when Peng Wang was
my intern last year--
we just say, well, we
already have this juicy stack
that predicts boxes.
Let's just-- instead of just
predicting corners of the box,
let's predict [AUDIO OUT]
segmentation mark
inside of that box.
So we can reuse the
segmentation machinery
that I talked about earlier.
But instead of applying
it to the whole image,
just apply it within the
patch So it's just a pipeline
approach, a two-stage pipeline.
And this actually
got second place
in the COCO instant segmentation
challenge last year.
So we're pretty
happy with that--
publish it though,
because it didn't win.
And methodologically,
it's not that novel.
There's various other
methods that are similar.
But it works really well.
You get really nice,
pretty pictures.
It's like-- even
in clouded cases,
like these children are
occluding each other,
and it can segment them out.
And in some cases,
it finds objects
that are really hard to see,
even for people, sometimes.
So it's great.
More recently, very
recently, there's
an alternative, slightly
different approach
from the team at
Facebook AI research,
and they call it Mask R-CNN.
And basically, instead
of first predicting a box
and then predicting the
mask inside of the box,
they predict them in parallel.
Other than that, it's the same.
We have this nice, sort
of generic set of tools,
so we were able to re-implement
this in a couple of days,
basically, because these
are just, essentially,
changing the wiring
diagram of your network
and letting it train.
And you have to change
the loss function,
is the other big thing.
But Alireza Fathi, who's on
my team, [AUDIO OUT] him.
He coded it up.
These are some
preliminary results.
It hasn't fully
trained or anything.
But you can already see,
roughly, what it's doing.
And you know, it's pretty cool.
So we get boxes, but
we're going to get
the masks within the boxes.
And it works even in
quite challenging cases,
like when there's
overlap, and so on.
So this is already useful to
product teams of various kinds.
Now, that's object.
So we've talked about
sort of-- stuff.
And we've talked about things.
And now, people are,
in some senses, things.
But they're, obviously,
a special case.
So people detection--
we can do, literally,
reusing our [AUDIO OUT] stack.
We just change the data.
But again, we don't
want just boxes.
We could get the mask of
the person, and we do.
But we also want to get the
articulation of the body.
And so far, we're
only doing it in 2D.
There's various groups working
on 3D pose estimation, which I
think is certainly more useful.
Chris Bregler joined Google
recently [AUDIO OUT],,
and we collaborate with him.
And in fact, one of
Chris's teammates
worked with us on the pose
estimation challenge in COCO.
So COCO had three
challenges last year.
We entered all of them.
So we won the object detection.
We were second in segmentation.
And we were second
in the key point one,
although there was a bug in our
code, and after the deadline,
we fixed the bug.
And then we
[AUDIO OUT] number one,
but it was too late,
so the history books
record us is being second.
That's OK.
So the approach is, again--
it's actually pretty simple.
There's a few twists that make
it publishable enough at CVPR.
And I'll tell you,
roughly, what they are.
But it's a two-stage pipeline.
We detect the person in a box.
And now we're going
to [AUDIO OUT] points
inside of that box.
And the key points are
represented as a heat map.
So it's a bit like a
segmentation thing.
But instead of simply predicting
the location in a cross grid,
we can learn offsets, just
like we do with bounding boxes,
like I mentioned earlier.
And so you're going to
get a [AUDIO OUT] vector.
And you can think of it as like
a mixture of Gaussian's model.
So you get a
weighted combination,
and you add this weighted
vector field together,
just like you would
in Hough voting.
And you get a much more
precise localization
of the key points than you
would with prior methods,
which is why we're
able to beat them.
So we're currently number
one on this leader board.
Everyone sort of
leapfrogs each other.
So the last time I
checked, we're number one.
And these are the numbers
that George sent me.
So last time he made this
slide, we were number one.
This is the average
precision metric.
And here we are--
0.649.
We beat the Facebook
paper that just came out.
And they claim they were
number one, so by transitivity,
we're surely number one.
And there's a group from CMU
that won the competition.
And so all of these methods
are similar at a high level,
but they differ in
some of the details.
And these details seemed to
matter for these problems.
So we're pretty happy with
that, and as you can imagine,
there's a lot of
applications for this stuff.
And [AUDIO OUT] work in progress
that George is pushing on
is to say, well, especially
in the case of people
and kinematic chains,
there's a lot of structure
that we can exploit.
We know it ahead of time.
Let's not completely throw
the baby out of the bathwater.
Maybe, let's revisit the CRFs
and deformable parts models
and try to leverage that
in conjunction with these--
the juicy [AUDIO OUT] signals
that we get with our conf nets.
And for simple problems,
when it's isolated people,
you just don't get any wins.
But when there's a lot
of occlusion and overlap,
and perhaps in
tracking scenarios,
we expect there to be
more significant gains.
But that's work in
progress, so stay tuned.
I-- pretty good,
actually, timing-wise.
So I aggressively
cut a lot of slides
because I didn't want
to go deep dive on that.
I'd rather go into more depth
in the following sections.
But before-- so that's sort
of interpreting the world
through, say, a single image.
We build up some state
estimator, in a sense.
We're going to do
something with that, right?
So we might want to
describe it to a person,
or the person might want to
interrogate us and ask us
questions, right?
So language is going to be
a medium in both directions
when we bring humans
into the picture.
So that's one, [AUDIO OUT]
sort of the main motivation
for studying it.
The other is also from a
machine-learning point of view
and a research point of view.
Everything I've
said up until now
is supervised
learning with CNNs.
And so you know,
the models, they
differ in, maybe,
the loss function
or exactly what
topology they use.
And there's a [AUDIO OUT]
in that space.
But at some level
of abstraction,
they're all quite similar.
There's a lot of
other models that
are interesting and
worth exploring that have
more expressive power.
So you know, RNNs,
recursive networks,
Turing complete is clearly
more powerful than a stateless
feedforward model, right?
So there are settings
where we want
to model on stateful
computational processes.
And then I'll talk a little bit
about variational ordering code
as the density models
that are unsupervised.
And so there are certainly, if
I switch to my machine learning
textbook on--
this is just a p of y,
given x, x is the image,
y is some kind of annotation--
body pose, label, box,
whatever.
This is the same thing, except
you've got lots of them, 1
to t, a variable number.
That's the important thing,
is that it's a variable.
Otherwise it's just a
fixed length vector, right?
And then this is the
same, except it's
a joint model of x and y.
And that lets you do
cool things because you
can have missing data, or you
just have images and no text
or just text and no images.
And you hope that
the latent variables
capture the correlation.
And I'm going to talk
about that at the end.
So there's many more
models that we can explore
and that we need to explore when
we get into language modeling
because the problem is more
difficult. You need to take--
well, in the general case,
sort of, it's AI complete.
You need human intentionality
and a very deep understanding
of the world to do a good job.
But even to do an
OK job, I think,
you can't just brute force
it by collecting label data.
So we do not only
need fancy models,
but we need to move
beyond just max likelihood
training with input-output
pairs and look
at other ways of training these
models with different object
functions.
So I'll give an example where
we use reinforcement learning
to train modules to optimize a
criterion, or reward function,
if you like, that's better
suited to the task that
is not likelihood of the data.
And then when we're doing
density modeling or latent
[AUDIO OUT]
modeling, we're going
to use variational base,
which is, basically,
converting Bayesian inference
into an optimization problem
because we like optimization.
We have good software for it.
But we will get uncertainty
and all of that good stuff
coming out of it.
So this is a richer playground.
It's more fun, if
you're a machine
learning researcher, than just
living in this top left corner
where we were
previously playing.
So that's sort of a
methodological point.
So I'll just have a quick break.
So I'll first talk about
mapping from images to tech.
So let's go back to this
example I started with.
There's two obvious
approaches, right?
We could take the image,
and we could parse it
into all of the pieces.
And I already explained
how we could do that.
And then we could
use those pieces
and convert it into
a sentence somehow,
and you could imagine
using template methods.
And in fact, people
do use that approach,
and it can work [AUDIO OUT].
And it could be
template-based, or this
could be a neural net that
has access to these signals.
Or it could just be
an end-to-end thing,
and that-- in some senses,
that seems to work better.
But we will-- and
I'll show some results
which look really cool that
take this end-to-end approach.
But there are some
caveats there,
which we will visit later.
So I'm, for the most
part, actually going
to follow the trend and do
this direct end-to-end stuff
and not really use the scene
interpretation that we've
built up.
We have those signals.
We should use them in
this part of the work.
We haven't.
It ought to help, but we
haven't really tried that hard.
So almost surely
everyone in the room
is aware of the breakthrough
that happened in 2015.
Simultaneously, several
groups at Berkeley and Toronto
and Microsoft, and so on, all
kind of stumbled [AUDIO OUT]
the same idea of treating image
captioning as a translation
problem, where you're
translating from images in one
language to captions in another
by-- you take your image,
you pass it through the CNNs
that I've been talking about
to create a vector
representation that squeezes
some semantic juice out of it.
And then [AUDIO OUT]
as conditioning
for a recurrent
neural network that's
trained to generate
sentences one word at a time.
And the features derived
from the image bias
the word choices that you make.
And the thing is-- trains on
supervised image caption pairs
to maximize
conditional likelihood.
So all of these methods
adopt that kind of framework.
These-- this is Oriol
Vinyals, et al--
he's a colleague of mine
at DeepMind, Fang, et al,
from Microsoft,
where they did use
some of this intermediate
structure of object detection,
and so on.
And they get broadly
similar results.
So you know, several
groups came across it,
and they're all getting
[AUDIO OUT] results like this.
You've probably seen
this stuff, right?
You give it an input like this.
It was annotated by a human--
three different types of
pizza on top of a stove.
And the model says--
two pizzas sitting on
top of a stove top oven.
You know, is it two?
Is it three?
It's not really clear.
You could quibble over that.
It's a little bit agrammatical--
or not agrammatical,
but disfluent-- sitting
on top of a stove top.
But you know, if you're,
maybe, not a native speaker,
it's not bad.
And you know, it's
pretty, pretty amazing,
especially compared to
where we were before.
And I could give you
more amazing results,
but let's look at the
not-quite-so-amazing results
because it's more
fun, more interesting,
if we want [AUDIO OUT]
improve things.
So you know, you get-- this
is a common failure mode.
These models don't really count,
and they often just guess two
because two is the most
frequent noun phrase--
[AUDIENCE LAUGHTER]
--in the data set.
These models have
millions of parameters.
And any bias in
their data, you--
they will find and
exploit it ruthlessly.
And then, [AUDIO OUT] there are
some really embarrassing fails.
[AUDIENCE LAUGHTER]
You know, the data
set, the COCO data set
doesn't have these
kinds of images.
And you know, we would like
this stuff to actually work.
I mean, we would like
to be able to annotate
all the world's images for
visually impaired users.
But if it's going to
do things like this--
and furthermore, if the
model confidently believes
this is the right answer--
we can't ship this.
This would be
really embarrassing.
So this has not launched.
And the accessibility team--
when they saw these--
I mean, this stuff was so cool
it made the New York Times.
And you know, they said,
hey, this is great.
We want to-- we'll
give you engineers.
We can code it.
We'll take your Python
prototype and redo a few.
It's not a problem.
But then they run it on
their data, and it fails.
And we say, OK, well, we haven't
quite got our act together yet.
So what are we going
to do to improve?
So there are some
basic problems here.
Perhaps the most fundamental
problem [AUDIO OUT]
we got from Jitender Malik.
But I think the
fundamental problem
is that it's really
hard to evaluate
whether the caption's any good.
And so if we can't measure
progress in a rigorous way,
we can't make progress.
So we decided to take
one angle on this,
which is to look at a
special case of captioning
where there's really a task.
And this task is called
referring expressions.
This is a standard setup.
In computational linguistics,
it's been around for a while.
And the idea is that
you have two people.
So we're bringing multiple
agents into the picture
because communication requires
at least two agents, right?
Otherwise there's not
really any point to it.
So we're going to have a
speaker and a listener.
And the speaker sees
an image and wants
to convey some information
to the listener.
And what he wants to convey
is some-- the location
of some object of interest.
So what he's going to do is--
in our setup, there's a
speaker that's given an image,
and we're told, please
describe [AUDIO OUT],,
so this box, which is
guaranteed to correspond
to an object of interest
coming from the same COCO data
set that I mentioned.
And then the algorithm,
the speaker algorithm,
has to generate a sentence
such that when that sentence is
received by the listener,
the listener can decode it
and correctly infer which
object was being referred to.
And if they correctly
decode the objects,
then they get points based on
understanding it correctly.
So you're rewarded both
for speaking clearly
and for comprehending correctly.
And we're going to train
these two agents cooperate--
it's really-- it's
a cooperative game,
so it's really just
one meta system.
[AUDIO OUT] game theory
or anything like that.
And we're going to
simplify the problem
by giving the listener
a finite set of things
it has to choose from.
But the nice thing
here is it's very
easy to measure performance
because you can just
say there's five boxes.
Did you-- what
fraction of the time
did you get the correct box?
Or if it's a regression setting,
you can measure [AUDIO OUT]..
And it's a meaningful
thing, right?
You didn't quite get
the location right,
but you were close.
So we can make progress, and you
can potentially train on this.
AUDIENCE: [INAUDIBLE].
Training
KEVIN MURPHY: Yes.
That's a very good question.
So we actually have
human label data.
Do I have a slide on that?
I'm not sure if I do.
We got people to create--
so we gain use
supervised learning.
So we had people annotate it.
And we ground it.
We're not actually
having the agents
create their own language.
There has been a lot of recent--
well, several [AUDIO OUT]
papers where they use
RL, where the agents just
sort of stop babbling, and
they create their own language.
And if you seed it with human--
a human language,
typically English,
it might stick to using that,
but it's not guaranteed to.
Here we're actually using
supervised learning, not
reinforcement
learning, and it's only
training on data [AUDIO OUT]
created by people.
And I'm going to
come back to that.
It's clearly a bottleneck.
I'm trying to get away from
the drug of supervised data.
But here, this is
the first time we
were working on
this problem, so we
wanted to start simple,
or relatively simple.
So I should mention this is--
the first author of this work
was [INAUDIBLE],, who
was, at the time,
a PhD student at UCLA
[AUDIO OUT] and has since
joined the Google Waymo,
or Alphabet Waymo,
whatever it is-- the
self-driving car team.
So this was published
in CVPR last year.
So how are we going
to tackle this?
So the baseline
approach would be--
you're given a
region of the image,
and now what you
could imagine doing
is just [AUDIO OUT]
features from that region.
And then just use the
pipeline that we already have,
which takes image
features and generates
a sentence using an RNN that's
conditioned on the image
features.
And then we can give it some
context from the whole image,
right?
So that's the baseline model.
It's a maximum
likelihood model that's
predicting a sequence of words,
given the region features.
And then, what would
the listener do?
The listener is given a
finite set of regions.
It could just rank
how likely each region
is to match the
sentence that was spoken
and compute them, the
most likely match.
So there's a max
likelihood classifier
very similar to what people
did in speech recognition
five or 10 years
ago, [AUDIO OUT]
would be, say, an HMM per
phoneme or something like that.
But there's obviously a
problem with many aspects of--
well, in particular,
the speaker.
If you have a setup like
this, and I say, OK, please
describe this region of the
image to this, to the listener,
it might just say, the girl.
There's no reason why it
should say, the girl in pink
because it--
we are giving it
the whole context,
but it doesn't really know that
there's some ambiguity here,
and so it should add extra
redundancy to its description
to make it unambiguous.
So what we can do is realize
that the purpose is not just
to describe this patch, but to
[AUDIO OUT] convey information
when there's some ambiguity so
that the listener can decode
correctly.
So there's a nice game
theoretic analysis of this
that Percy Liang and
colleagues came up with.
I think there's too many
symbols for me to decode here.
But the bottom line
is that the speaker
should take into account the
belief state of the listener,
essentially.
And when they're creating
a sequence of words, w,
for a region, they
should make sure
that the likelihood
of that description
is higher for the true
region than for any
of the other regions
because if that's true,
then the max likelihood decoder
is going to work, right?
You know that will be
ranked correct [AUDIO OUT]..
So if you can satisfy
this criterion,
then it will ensure that the
listener will decode correctly,
and you'll both be happy.
So what you can do then,
is then, instead of--
let's see if I have it.
So it's a very
simple change, right?
Instead of maximizing the
likelihood of the words,
given the regions, we can
maximize the [AUDIO OUT]
probability of the true
region, given the words, right?
And so in speech
recognition, this
is called MMI training,
maximum mutual information.
This is a blast from
the past if there's
any speech people in the room.
But it's just-- you
compute the posterior,
and then you're going
to maximize that.
So it's discriminative training
because of this normalization
constant because you'll
take into account
the relative likelihoods
of each of the regions--
and making sure the true one
is higher than the others.
And we also tried
a ranking loss,
and it's more or less the same.
I like this better because
it's probabilistic,
but they're very similar.
So we just changed
the loss function.
And pretty much everything
else is the same.
And lo and behold, it helps.
So this is the data collected.
I'll skip this.
So you just train that up,
max likelihood training.
And I'm skipping the
architectural details
that are in the paper.
And let's-- just a little demo.
So here's an image.
If we point to this guy
on the left, and we say,
please describe it, it will say,
a man wearing a black jacket.
And we point [AUDIO OUT]
right, and it
says, a woman in a black dress.
It's pretty good.
Let's do one more--
red bus, double decker bus.
It's pretty good.
Doesn't always work--
let's look at--
this is an interesting
failure case.
We point to this thing.
And it says, a bus in the right.
Well, it looks like
a train, not a bus.
Is it on the right?
Or is on the left?
That's the point of view of
the speaker and listener.
It doesn't capture
that kind of subtlety.
It's not exactly
grammatical-- "in the right."
So you know, there are flaws.
But you know, it's
pretty good, actually.
It's surprisingly good,
given how simple it is.
And on the listener's side,
we can [AUDIO OUT] it's
fairly adaptive.
So here's an image.
Like I said, we're going to give
it a candidate set of regions.
And these actually come
from the object detector
that I mentioned earlier,
but the class agnostic object
detector, the region
proposal network,
also called multi
box, that just says,
her are five or 10 object-like
things I found, [AUDIO OUT]
about one of these.
And then, depending
on what I say,
I'm going to highlight
one of those candidates.
So if I say, a black carry-on
suitcase with wheels,
it will pick this one.
If I say, a black suitcase,
it picks the same thing,
a red suitcase, it
flips over here,
a truck in the
background, it picks this.
So you know, it's responding.
And [INAUDIBLE] made a--
I don't have it [AUDIO OUT].
He made a nice little
JavaScript demo.
And you can try it
and poke holes in it.
And you know, it's pretty cool.
Anyone who's worked
on the COCO data set
knows it's full of giraffes,
so we have mandatory giraffe
examples.
So you know, this is
pretty interesting.
So we have a nice
objective function
to measure [AUDIO OUT].
And you know, we're capturing
some aspects of communication,
which is a multi-party setup.
But we were relying
on manually labeled
data to specify
discriminative descriptions
for each of these
regions, take into account
what the confusing
categories were.
And this isn't true-- isn't
possible, in general, right?
So imagine we
change the scenario.
So instead of having
a single image
and there's one
region out of five
that I'm trying to
discriminate between,
I have a set of images.
And there's one
member of the set
that I'm trying to
describe to you.
And I want to describe this
set-- this instance [AUDIO OUT]
all of the others.
And this set could be arbitrary.
It changes at runtime.
It could be quite big.
Here, I'm going to focus on
a single distractor image.
So what I'd like to do
is describe this image
on the left such that
you won't confuse it
with the one on the right.
So a default model,
which is just
like a max likelihood
[AUDIO OUT] on caption data
does, in fact, say it-- like, an
airplane is flying in the sky,
which is a reasonable
description if your task is
to describe this image.
But if your task is
to distinguish it
from other members of the set,
where-- like, in this case,
the set has two elements, then
you're going to get confused.
So what you should
do is say, well,
if my goal is to be
discriminative or distinctive,
then I should maybe generate
this-- a large passenger
jet flying through a blue sky.
So this is not a passenger jet.
It's clear, if you hear
that, you're referring
to this one on the left.
So this is very similar
to the setup I had before,
except the key
difference is we're not
going [AUDIO OUT] collect
training data that
is explicitly discriminative.
We're just going to reuse the
caption data we already have,
and we're going to change
the way the model works.
So we're going to dynamically
derive discriminative
functionality from a
model that was trained,
in sort of a generative way.
It's a pretty simple idea.
So we're just going to, again--
same principle.
We're going to
modify the speaker
to pay attention to the needs
of the listener in a simple way.
So the key idea is
to take into account
that the listener is going to
be computing a likelihood ratio.
If there's only
two choices, it's
going to be doing
this max likelihood
decoding, like I mentioned.
How likely is the sentence
under this hypothesis
versus this hypothesis?
So when we're
considering a sentence
that we might generate,
we want to make
sure it's more likely under
the true image, as opposed
to the distracting image, right?
So we're just going to have
this log likelihood ratio,
and we're going to generate a
sentence that maximizes that.
But that could give rise
to agrammatical sentences,
so we're also going to have a
language model term, which just
says, generate me a
sentence that's likely,
but also one that is more
likely under the correct image,
as opposed to the
distracting image.
So that's our objective
function that we
use at runtime to
decode our sentences
and [AUDIO OUT] rewrite this.
Since it's just a
sequential model,
you can rewrite this as a
sequence of conditional terms.
And then you can
use beam search,
and then you just modify
the beam search algorithm
with a slightly different
decoding function.
And you can decode
from this greedily,
and it's very
simple to implement.
So again, the first
author of this work
is Rama Vedantam, who was
interning at Google last year.
And you'll see his
name pop up again
in another part of the work.
He's a student at Virginia Tech,
about to move to Georgia Tech.
And the co-author, [AUDIO OUT],,
many of you know him.
Devi is Rama's adviser,
and Gal Chechik
is a colleague of
mine at Google.
So let me just show
you some results.
So the generic model
is the captioning model
I mentioned earlier,
which we used,
I think, one, at
the standard show--
I think it's show,
attend, and tell.
It's the CNN [AUDIO OUT] thing.
So if you give it this
green image, it will say,
a man and a woman
playing a video game.
But that's a bit ambiguous.
This introspective speaker
that does the likelihood ratio
decoding says, a man
is sitting on a couch
with a remote,
which is certainly
a better fit to this image
than to that one, right?
Let's do one more example.
The generic model would
say, a train traveling down
tracks next to a forest.
And we've chosen the test
set so that it's ambiguous
by matching-- they contain--
in this case, I
think they either
have the [AUDIO OUT] same
captions, I think, either
according to the model or
according to the humans.
So these images were
prepared to the same caption.
I can't remember now if
this is in the training data
or if it's due to the algorithm.
But in any case, on
their own, these sort of
map to the same point in
language space, in some senses.
But as a pair, they're not--
they need to be distinguished.
[AUDIO OUT] would
say, a red train
is on the tracks
in the woods, which
is clearly more-- a better
fit for this left one
than the right one.
So it's doing
pretty well, right?
And it's very simple.
It's quite happy-- it was very
happy when these things work.
And then ultimately,
we want to know--
does it work with
humans in the loop?
[AUDIO OUT] So we
did an AMT study,
and we looked at
two settings where
the-- let me see if I remember.
The easy confusions
are ones that
are confusing images that are
similar in some feature spaces,
such as FC-7, some layer
of the neural network.
And then the hard ones are not
only similar in feature space,
but they all [AUDIO OUT]
have very similar captions,
according to humans.
And in any case,
in both scenarios,
our introspective method,
IS, is significantly better
than the baseline, which
just did standard max
likelihood decoding.
So this is cool.
You know, I think we're
making some progress.
Both of these
models were trained
using maximum likelihood.
And they were-- the decoding
was using a different objective,
but the training was still ML.
But that's a problem because--
let's see.
What are the problems
of maximum likelihood?
Well, with these
sequential models,
we're decoding one
word at a time.
So I'm predicting-- the
train on the tracks.
Let's see if I can pick a
more interesting [AUDIO OUT]..
It could be-- the train
on the turnstile, right?
So by the time I
get to "on the,"
[AUDIO OUT] is probably
the most likely word
in my grammar model.
And if I accidentally made an
error there and said something
else, I'm not going to be
able to recover from it
because my language,
prior, is so strong,
if my prior predictions
are different from that,
I'm conditioning
on things I haven't
seen in the training set.
So I deviate from
what I [AUDIO OUT] on,
and I start entering parts
of models-- data space
that the model hasn't
been exposed to.
So this is called the
exposure bias problem.
Because you're--
at training time,
you're always conditioned
on the ground truth prefix.
But at test time, you
always conditioned
on the predictive prefix.
And they might start to
become arbitrarily different,
and the models can
perform poorly.
So that's a well-known problem
with max likelihood training
in these sequential models.
So what we can do is to
replace maximum likelihood
with some other objective
that looks at, maybe,
the overall fluency
of a sentence or its--
how well it performs at
some discrimination task.
And we can use reinforcement
learning methods
that are able, in principle, to
optimize black box functions.
So in particular, we can use
the policy gradient algorithm
or code reinforce to optimize
anything we want, in principle.
So the MIXER paper from
some Facebook guys--
they published it last year,
and they used this approach.
[AUDIO OUT] the
BLEU score, which
is a metric from the machine
translation community.
And they showed some wins over
just max likelihood training.
But there's a couple
of problems with this.
So the biggest problem
is the BLEU score
is just not very well correlated
with human judgment of caption
quality.
It's a very syntactic thing.
It may be better for
machine translation,
but for image captioning--
if you look at the quality
of algorithms compared
to the quality of people,
according to BLEU,
it seems that the algorithms
are better than people.
So that's clearly not true.
So this is just
clearly a bad metric.
And this is the
case for pretty much
all of the automatic metrics.
The other issue [AUDIO OUT]
is a bit more detailed, but--
and I'm not going to go
into that level of detail.
But their particular
policy gradient algorithm
is extremely high variance.
And it's difficult to use.
So we came up with a better
policy gradient method.
And with our better
method, we were
able to explore alternative
objectives that are more
correlated with human judgment.
So in particular, the
approach that we took-- this
is a bit of a technical
detour for those who
are familiar already with this.
When you're estimating
the cost to go,
so your Q function,
the MIXER approach,
basically, averaged the
reward across all the time
steps and [INAUDIBLE] constant.
And we do something similar
to what they do in AlphaGo.
We do, like, Monte
Carlo roll outs.
So we have a partial sequence
of actions or words said so far.
Then we hallucinate possible
endings of the sentence.
We feed these complete sentences
to our black box evaluation
function that I'll tell
you about in a minute.
We get a score, and then
we average over those.
And we use that
average [AUDIO OUT]
to tell us how well we're doing.
And these curves are
showing the various metrics
of BLEU and ROUGE and CIDEr and
METEOR as a function of time.
And we're the blue curve.
And you can see we are much
higher and we are faster.
And the previous method, which
is in green, isn't as good,
and its [INAUDIBLE] was
very hard to tune, very
sensitive to hyper parameters.
So we found that this
particular combination
of policy, gradient,
and Monte Carlo
roll outs was
easier to work with.
And then that lets us explore
the space of reward functions
to try to find something that
matches human judgment better.
So fortunately-- so this
project got started--
I went to ECCV last year.
I think it was in Amsterdam.
And there was this
presentation called SPICE.
This group from Australia,
Mark Johnson's group,
they came up with a metric that,
for the first time, [AUDIO OUT]
put humans at the top
of the leader board,
where they ought to be.
So this plot is
showing human judgment
on the x-axis and automatic
score on the y-axis.
And these dots
correspond to, I think,
judgments according
to different systems.
And this is judgment--
this is human
judgment [AUDIO OUT]..
These methods-- so this is the
BLEU score I mentioned here.
The algorithms are all in blue,
and they're scoring higher
than the red human.
And then the CIDEr metric--
the best algorithm
is, apparently, better
than the best human.
And then the METEOR
metric-- similarly.
So they came up with
the SPICE metric,
and finally, for the first
time, [AUDIO OUT] humans
are not only higher ranked
than the algorithms,
but there's a
better correlation.
So the way it works is that
they parse the caption into--
they actually have
multiple captions,
and they parse them all.
And they build a
scene graph, and they
extract the semantic
content of the sentence.
And then they measure
how well that match--
for the ground truth.
And then a generated caption
is [AUDIO OUT] similarly,
and they match in graph space,
rather than in grammar space.
So they're really saying,
is the semantic essence
of the sentence
similar, as opposed to
is just the sequence
of tokens that a--
are generated similar.
So this feels like the
right thing to do, right?
So an obvious thing
is-- so I-- you know,
[AUDIO OUT] is awesome.
We have this hammer.
We can optimize anything.
They have a way of
measuring performance
that seems to be correlated
with what we want to do.
Let's just optimize the
crap out of the SPICE score.
But it turns out,
if you do that--
I don't think I
have any examples--
because it's only looking at
the semantic structure and not
the syntax, [AUDIO OUT]
you will, of course,
do well by this metric.
But your sentences
aren't very grammatical.
So we did a simple thing.
We simply mixed the
SPICE metric that
captures semantics
with the CIDEr
metric, which captures syntax.
And we call it the
combination SPIDER.
And then we optimize that.
And so if you optimize that
using [AUDIO OUT] policy
gradient method, and then you
show those captions to humans,
then humans like us
much more than they like
other methods, which is good.
So we're generating stuff
that people are happier with.
You can play a different
game if you want.
We can take the metrics that are
used in the COCO competition.
So it's a combination of BLEU,
CIDEr, METEOR, and [AUDIO OUT]..
And we can optimize that.
And then you can--
we used, like,
a really dumb model from two
years ago, the old show--
like, it didn't even have
attention-- really simple,
you know, VGG--
simple baseline model.
But you optimize it by this.
And we were number one for
about a week on the COCO leader
board.
[AUDIO OUT] we were
number one the time
we submitted, so we
could brag about it.
But then we got beaten
by someone else.
But our point was like-- we
were using a very simple model.
We were just optimizing
for the right thing.
But if you show those
captions, the ones that
win the competition to
people, they don't like them.
They don't like them as
much as if you show them
the captions that are
generated by optimizing
this more human metric.
So we're pretty happy with this.
And some of my
colleagues are planning
to try to see if we can get
this working on real data.
But I'll come back to that.
But let me just show you
a couple of examples.
So this is the fire--
there seem to be
a lot of fire trucks in COCO.
So these are the five
[AUDIO OUT] humans.
You can read some of them.
This is just the default
baseline max likelihood
training-- a red and white bus
is driving down the street.
Well, it's not really a bus.
The previous max method, MIXER--
a yellow bus driving
down a city street.
And then, this is us at the
very bottom-- a red fire
truck is on a city street.
Seems better, right?
Let's just do one more.
This is our method.
The baseline-- a woman
walking down a street
while holding an umbrella.
OK, that's ridiculous.
Our method-- same model,
different loss function--
a group of people walking down
a street with a traffic light.
It's pretty good, right?
So we're quite happy
with this [AUDIO OUT]..
Since then, various
other teams have come up
with even better ways of
regularizing or stabilizing
the training process,
which I won't get into.
So let's-- I'll do five more
minutes, and I'll wrap up.
This is the most
recent piece of work.
We just submitted
it to [AUDIO OUT]..
We'll put it up on
archive shortly.
But let me just motivate it.
You know, it sounds
like everything's great.
And we've made progress.
We made human raters
happier than they were.
So my colleague on
the accessibility team
says, OK, you ready now?
Can we launch?
No, we still can't launch
because-- look at these errors.
It's just ridiculous.
[AUDIO OUT] from our system.
This is from a
couple of years ago.
But we still do make
embarrassing errors.
So a cat is sitting on a toilet
in a bathroom, a woman laying
on a bed with a laptop.
I like this one--
a man in a suit and tie
is holding a cell phone,
a man is riding a
skateboard on a ramp.
I mean, this is ridiculous.
[AUDIO OUT] so
we've made progress.
This-- we're optimizing
a metric that's
closer to human judgment.
But it still feels
like we're sort
of skimming on the surface.
We're picking up on
correlation, and we're
more correlated with humans.
But are we really
understanding what's going on?
I don't [AUDIO OUT].
So we wanted to step back a
bit and do some science here
and say, OK, guys,
we're not going
be able to deliver on your
launch deadline of six months.
We really need to
step back and try
to get a bit more, a richer
understanding of what's
going on.
So look at some of the
core problems in language
understanding and [AUDIO OUT].
And there are lots
of them, right?
So one of them is just
the variability with which
people describe the
worlds or the-- which
appearance of objects.
We have a good handle on that.
These neural network
models are very
good at learning to be
invariant to lighting and color
and occlusion, to
a certain degree.
But these are sort of
local, statistical noise,
as it [AUDIO OUT].
There's other kinds of more
radical variation in the world,
like new combinations of things
that I've never seen before,
structurally novel
combinations, compositionality.
You know, this is fundamental,
especially to language, right?
The world is fundamentally
combinatorial,
and so you're never going to
cover it with training data.
So we need to [AUDIO OUT],,
just grab the bull by the horns
and just address this as
a first class citizen,
which means we have to get away
from just random train test
splits where the
test set is pretty
similar to the training set with
just slightly different colors
on your pixels.
So we're going to do
compositional splits, where
we guarantee that the
thing [AUDIO OUT] on you've
never seen at training time.
And they're going to be
structurally novel test sets.
And then we want to deal
with abstraction, which
is related to compositionality.
So if we have a lot
of signals that vary,
what do they have in common?
What is the essence
of the concept
that you're trying to convey?
And what are the things that
are just random and incidental?
And sure, you might pick
up on that correlation,
but it's not really the core.
And there are, of course,
lots of other problems,
but these are the issues
that we want to tackle.
So we started to think
about language and vision
sort of more deeply, I guess.
So one key thing
is [AUDIO OUT] when
you describe a compositionally
novel sentence to someone,
they may never have heard of it.
It might not even
exist in the world,
but they can still usually
understand it, right?
So if I say "purple hippo,"
it evokes some representation
in your head.
We don't really know-- this
is the thought bubble--
we don't know what that is.
[AUDIO OUT] some distributed
representation in neural net.
Maybe it's-- who knows, right?
I don't want to commit
to what that is.
We want to be agnostic
to what that is.
But then if I probe
you, and I say,
well, do you understand what
I'm talking about, they'll say,
sure.
OK, well, prove it to
me like you would if you
were examining your students.
So a reasonable thing
to do in this scenario
is to ask the student to draw
or to sketch, any way, not photo
realistic, but,
like, OK, show me
what you think
I'm talking about.
And they might
generate a diversity
of samples that sort of capture
the essence of this description
or this concept.
And now you could
say, well, I don't
want just any
purple [AUDIO OUT],,
I want to say purple
hippo with wings.
So I should be able to be
more specific as I add more
constraints to the problem.
And now, presumably, the
thought bubble, the distribution
of possible worlds has shrunk
because I've added constraints,
and therefore, the set
of samples I generate
should be less diverse.
They should be consistent
with what I say,
[AUDIO OUT] but they should
spend the space and not fill
in details arbitrarily when I--
for things I didn't
specify, right?
So just some-- we're going
to build up to a model.
We're going to call these
text descriptions, y,
the internal representation
z, and the generated images
will be x.
And of course, we
could do the reverse.
We could have a set of images.
And we could say, OK,
please describe that.
And so this is concept learning.
This is very much inspired
by Josh's thesis, right?
So if I give you these
images, you would-- like,
the least common ancestor, in
some sense, is purple hippo.
These are both purple
hippos, but there's
a more parsimonious explanation,
which is a tighter [AUDIO OUT]
sphere around-- which captures
the data, but only the data.
And that would be purple
hippo with wings, right?
So we would like to capture
that kind of phenomena, as well.
We actually haven't worked on
this particular problem yet.
We believe the model I'm about
to show you can solve it.
We just haven't had time to try.
But [AUDIO OUT]
the model is based
on variational autoencoders,
which some of you
may be familiar with.
So this is just the
latest variable model.
You have some latent variable z.
We're going to assume it
has a Gaussian distribution
because it makes
everything simple,
but it doesn't have to be.
And then we have these
two modalities, right?
So we have images off
on one [AUDIO OUT]..
And we have text
off on the other.
And we're going to
generate everything,
so it's not
discriminative anymore.
And that means we can
train partially supervised.
We can have images on their
own, or labels on their own,
or both.
And we do want some
paired data so we
can learn this correspondence,
but we don't necessarily
require a lot of it.
So [AUDIO OUT] of this work
is, again, Rama Vedantam,
who did the discriminative
captioning work.
He liked it so much at
Google, he came back again.
Last time, he was
on another team.
This time he interned with me.
And then my colleague,
Ian, works a lot
on this, and Jonathan Wang,
who I mentioned earlier
in the object detection project.
So this will be coming
up on archive [AUDIO OUT]
in the next couple
of days, actually.
So with VAE-- so these joint
models have been around
for a long time, right?
There's nothing new here.
The sort of breakthrough
a few years ago
is to try to make
inference more efficient.
So what you can do is,
you can train a network
to approximate the
inference process.
So what we're going
to do is we're going
have three inference networks.
So we'll have one
inference network
that infers the posterior
over the latents, given
pairs of data.
And that's, maybe, what
we have at training time.
But at test time, I might
only hear a sentence.
And I want to imagine the
meaning of that sentence,
so I'm going infer z given y.
So I need a network
that only works
in text modality [AUDIO OUT].
But I might want
to do the reverse.
I might want to have an image
and embed it into my concept
space so I can describe it.
So I'll need an image-specific
inference network, as well.
So we're going to have
three networks that
capture these different
types of data,
and we need to
jointly train them.
And there's been several papers
on, like, multimodal VAEs.
And they all do
slightly different ways
of training these.
And I'm not going to go
into the relationship
of-- between our
work and theirs.
It's in the paper.
But what we do is we
use just neural nets
to parametrize these
networks in the usual way.
And then there's a couple
of novelties [AUDIO OUT]
have this slightly
different objective function
that-- we call it
the triple elbow
because there's three elbows.
I'm not going to get
into this, but those
who know the elbow, this is
the usual elbow and joint data.
We have an elbow just
on x's and just on y's.
And this gives us a way to
train these three networks
simultaneously.
And then we wanted to test this.
So we just threw SGD at it.
And we [AUDIO OUT] probe how
well it's doing in this sort
of controlled setting
to see if we could--
if we're tackling
these basic issues.
So the first thing
we did was to take
MNIST, as everyone
does who works on VAEs,
and we replaced the class
labels with an abstraction
of the class label.
So we just gave it two
bits, either the parity--
it's either odd or even--
or the size.
Is it a big number,
bigger than 5?
Or is at less than 5?
So we're not doing
natural language.
We're just doing attributes.
And in this case, they're
two binary attributes.
It's very, very simple.
So you can fit them all to this.
So you've got a bunch
of images, and you've
got these little bit vectors.
If you fit an [AUDIO OUT]
model, which only has images,
and you fit it in 2D, and
you look at the latent space
that it induces, there are
these four categories, right?
There's small and even, and
small and odd, large and even,
and large and odd.
It doesn't devote any
space, in latent space,
for this large even category,
the red guys [AUDIO OUT]..
Even though there are
digits in that group,
there's no reason why it should
allocate mental space to it
because it doesn't
know these labels.
And it can recreate the
likelihood of the pixels fine
without it.
And you can monkey
with the loss function.
The beta VAE paper
is from DeepMinds.
Basically, they change the
weighting term on the KL.
And there is InfoGAN paper
from OpenAI, where they, again,
sort of-- it's a GAN, not a VAE.
But it's pretty similar.
They sort of change
the weight terms,
but they're still just
dealing with images.
And it's nice to try to
squeeze as much juice
as you can from images alone.
But there are going to be some
high-level [AUDIO OUT] where
you're going to need some
linguistic or some kind
of structural side information
to tell the system what
it is you care about.
So if you have the joint
model, and you fit it naively,
it starts to get-- do better.
It has to generate the
labels as well as the image,
so it's going to devote
some capacity of its model
to that task.
But the [AUDIO OUT] bits and
the pixels, then, in the label,
so it emphasizes
the pixels more.
You can just weight
the labels more highly.
And if you scale
them appropriately,
you get nice decomposition
of your latent space,
and you could, obviously,
do well at classification
if you wanted to.
So that's cool.
But more interesting
is that [AUDIO OUT]
do posterior inference.
So like I said, our
latent space is Gaussians,
so our inference
network Q is going
to predict the
parameters of a Gaussian.
And so if I twiddle the bits,
I'm, like, giving it sentences,
and it will map to
the appropriate part
in latent space.
But what we want
to be able to do
is just to describe the world
at different levels [AUDIO OUT],,
right?
So let's see what
the next figure is.
So I should be able to specify
the concept "all even numbers."
And I don't care if
they're small or big.
I want all the evens.
Or I want all the smalls.
Or I just want all the numbers.
And then I should be
able to generate samples
for my model that are
consistent with what
I did say and [AUDIO OUT]
entropic over what
I didn't say.
So how are we going to do that?
Well, we need an
inference network
that can handle missing data.
So what we decided
to do is to use
a product of experts
on the assumption
that these attributes
are roughly orthogonal,
at least in this setting.
So we can-- each expert is
its own Gaussian distribution
that's mapping that particular
attribute to latent space.
And we're going to combine them
multiplicatively so that they--
when they agree, they're going
to carve out a part of space.
So the individual experts, like
this tall one on the right,
is capturing the
concept of small.
And this tall one on the left
is capturing the concept of big.
And this is the even expert,
and that's the odd expert.
And then, if you want to
capture small and even,
then the two experts
fire together.
And they-- these Gaussian
bubbles intersect.
And the nice, in general,
products of experts
are intractable, but in Gaussian
land, everything's analytic.
And it's straightforward
to compute.
So you can fit this.
One thing to notice is that
these individual experts
are a bit weird.
They have [AUDIO OUT].
What's the time?
I'm really running over.
Yeah, it was a bit
too much detail.
But they have these wide
tails because they're normally
always present.
In Geoff Hinton's original work
all the experts were firing.
In our work, we have a variable
number of experts firing,
depending on what you observe.
So we can just have what we
call the universal expert, which
is the prior.
And that regularizes it because
it's always being multiplied.
And then you get
this beautiful sort
of Olympic-rings-type
structure where
you've got this broad concept.
This is the prior.
You've got these
specific things capturing
aspects of the problem--
parity or magnitude.
And then we can
make compositions
of these individual
components just
by combining these [AUDIO OUT]
experts dynamically at runtime
without having to specify it.
So then we want to
evaluate these things.
We don't want to be-- it's
nice to look at these pictures,
but we want to
measure, objectively,
how well the system's doing.
So we've proposed three criteria
for evaluating any module.
It doesn't have to be our model.
So we can't look
inside your head,
but we can ask you
to generate images.
And we're going
[AUDIO OUT] those images.
So you generate a set, s.
And then we say, OK, are the
images you generated correct?
So that-- what
this says, simply,
is we're going to
apply classifiers
to your generated images.
And we're going to see
if the predicted labels
from our classes match
the things that I
told you to generate.
So if I say purple hippos, they
better [AUDIO OUT] be hippos.
But I didn't say
if they're flying
or if they're all eating grass.
I don't care about that.
So you have to match on the
bits that you require, and you
don't care about the rest.
We also want coverage.
So we want to be--
the things that
I didn't specify,
I want you to give me a variety.
I want some flying hippos.
I want some flip--
hippos with wings.
I want some [AUDIO OUT]
water, some in the field.
So we want to measure--
this is-- that's not--
parse the syntax.
But that's the idea.
So we want diversity,
so we're going
to cover the extent
of the concept,
and not just give
me a single example.
And then we want to
handle compositionality,
which we do, simply by
partitioning the data so
that we get structurally
novel [AUDIO OUT]..
So then to test these--
we had a slightly
harder task that we
called MNIST with
attributes, where
we're sliding the digits
around, and we describe them--
you know, this is
the class label.
But we can say, this
is a small digit.
It is upright?
Is it in the top right or a 4 in
the bottom left, and it's big?
So now we can fit a model on it.
And we can say, OK,
please generate me some--
a 4 which is big,
upright, and bottom, left.
And there is some samples.
And these are some rival
methods that we do better than.
This is variational canonical
correlation analysis,
and it's blurrier than us.
And the classifier
doesn't like it.
It gets the bits
wrong [AUDIO OUT] red.
This is the joint
multimodal VAE,
which is similar to us, but a
slightly different objective.
So our samples are
correct, more correct.
They're sharper.
This is a bit more interesting.
We can give partially
specified queries.
So I can say, just generate me
something on the bottom left.
[AUDIO OUT] what it is.
So sometimes it's 0.
Sometimes it's 3's
or 9's or 6's, right?
Or I might say, I want
it to be 3 and big,
but I don't care where,
and it slides it around.
But if I clamp all the bits,
then it's more specific.
And what's going
on under the hood
is that these Gaussians
I initially brought in
are shrinking as we
condition onto more bits.
And that's inducing this
narrower distribution.
It's very similar to
Josh's thesis, where
he had distributions over
a hypothesis space, which
is either, like, the
number line or a tree.
And in our case,
it's a latent space,
which is nice because we
can fit any kind of data
to this latent space.
This is dynamically
changing as we condition
on more or less data.
And then, we can do
it [AUDIO OUT] split.
So we can give it a query.
It's never seen anything
that's zeroes, bigs, uprights,
and top rights, and it
does the right thing.
It's seen zeroes on their
own and bigs on their own,
but it's not seen
this combination.
And we do better than others.
And we can quantify all that.
And you know, we beat
the other methods, too,
especially by VCCA.
There's a healthy margin.
The other method we beat,
[AUDIO OUT] it's a smaller gap.
OK.
So that's very recent work.
I'm pretty excited about it.
There's clearly a long way
to go between, like, playing
with endless digits and
the kind of real data
that I was talking
about earlier.
And we need to bridge that gap.
So maybe I'll just mention
that future work is, basically,
to try to bridge that gap.
And furthermore, we want to move
away from just single images
and look at the active scenario
where we have streaming video,
and we're interacting with
people, ideally in real time,
and that raises a
whole host of issues
that I didn't talk about, which
is lots of juicy future work.
OK, thank you.
[AUDIENCE APPLAUSE]
