>> It's my real pleasure this
morning to introduce Chris Manning.
Chris is the Siebel Professor of
Machine Learning at Stanford
University where he has
a joint appointment in
computer science and
linguistics reflecting his broad
and multidisciplinary background.
He's also the Director of the
Stanford AI Lab and is one of
the folks on the advisory board of
a new human-centered AI
Initiative at Stanford.
So you can see the breadth of
his influence at Stanford
and well beyond that.
He's very very well known for
his work on statistical
natural language processing,
ranging from more
linguistic perspectives
on parsing and grammar induction,
to more machine learning perspectives
including his recent work on
glove and many other
techniques that leverage
large-scale data to solve
very interesting problems.
Chris is amazingly well known
in many different communities.
He has over a 100,000 citations.
You can figure out how
many that is per waking
hours since he published
his first paper,
it's probably more than one.
But another part of
his influence I think,
goes above and beyond his research,
that spans language,
artificial intelligence,
machine learning,
information retrieval.
He's a fellow of all three of
the major societies
in computer science,
linguistics and AI,
but he's also contributed to
the field in many other ways.
He is a co-author of
two really important books,
one on the foundations
of statistical NLP and
another on modern
information retrieval.
But he's also, with his students,
develop many tools that are shared.
How many have used
the Stanford Parser
at the Sanford NLP tools?
He shared code of all kinds,
he shared representations,
things that arose out
of his work on glove.
So Chris, I think is a real scholar
in many senses of that word.
What he's going to do today is
talk to us about how we can go.
Right now if you look at
language and representations,
many of them focus on understanding
the relationships among words,
similarity structures in words,
but he's going to talk much
more broadly about how we
can bring notions of
memory, attention,
compositionality, and
reasoning to develop
systems that have
much more intelligence
than many of our current system.
So Chris, without
further ado. Thanks.
>> Okay. Good morning and thank you
Sue for that very kind introduction.
I hope I can live up
to it and thanks to
everyone for coming to the talk.
Yes. So today what I want to
do is talk about some
of our recent work
on trying to build
neural networks that can
reason and this is joint work
with my student, Drew Hudson.
So our current neural network
machine-learning systems truly
excel on a variety of tasks such as
speech recognition
and computer vision.
Interestingly, they've even been
pushed with surprising success to
other tasks which you'd think
involve higher level
reasoning and understanding.
Notable success has
been the development of
neural machine translation which
has worked exceedingly well.
But in some sense,
even neural machine
translation has approached
things as a stimulus response task
that you are provided,
here's the input sentence,
produce the output as a kind of
a statistical association task.
So these tasks don't require
deliberate thinking or reasoning,
the kind of processors that
Daniel Kahneman has referred
to as thinking slow.
So what can we do about reasoning?
Reasoning is central to figuring
out a good approach when humans
are faced with a new problem in
developing longer-term plans
for higher level objective.
Can we use neural networks
to do the more
deliberate, conscious,
multi-step thought?
So that might be things like
middle-school style reading
comprehension problems,
which aren't really like what most of
the current reading comprehension
question-answering datasets are like,
but much more understanding,
a broad understanding of
the characters and what
they're doing in a novel.
Commonsense reasoning
and problem-solving,
working out plans,
playing a strategic game.
For tasks like this,
we really need a better notions
of having knowledge,
reasoning, and inference.
So what then is reasoning?
How can reasoning, which has
traditionally been thought of
in connection with logics
and hand-built knowledge
representations,
be learned by the distributed
representation computation units
of deep learning?
So one inspiring viewpoint on this
was provided by Leon Battou in 2011,
who suggested that the heart of
reasoning is having the means
for algebraically manipulating
previously acquired knowledge,
in order to answer new questions.
So Battou suggested that
we could seek to enhance
deep learning systems with
reasoning capabilities,
right from the ground up.
That reasoning is not
necessarily achieved by
making logical inferences.
That he saw a continuity between
algebraically rich inferences and
connecting together trainable,
learning components
and in particularly
emphasize that central to reasoning
is composition rules,
to guide the combinations of
modules to address new tasks.
So one of the things I want to think
about today is how can we start to
build composition rules for
multi-step reasoning
in neural networks?
So underlying this is some kind
of a conception as to where things
might head in deep learning.
So in some sense,
the dominant viewpoint in
the deep learning community is to
seek a learning device that
is as empty as possible,
some kind of tabula rasa.
So what's happened in a lot of
recent work is that the emptiness of
the machine learning
devices compensated
for by providing extra
information in the inputs.
So very successful technique in
recent years has been
data augmentation.
So effectively, you are providing
lots of extra inputs
on a vision system,
you rotate the images
a little and shave off
a few pixels and changed
the coloration and things like this,
but you're saying all of
that your model you want to
build should be invariant
to all of those things.
However, I believe that
we shouldn't be afraid of
good inductive biases of
thinking about how we
could design our
neural network systems,
so that they are built
so that they have
the ability to learn
quickly and well.
That probably having
some architectural constraints
is vital to how human beings
are able to learn so
quickly and well.
Indeed, I think the biggest
breakthroughs in deep learning,
have come from building
the right inductive biases,
structural priors into models.
Now, of course you can fail if you
build models with
too rigid structure,
but we succeed by finding
appropriate but flexible
structural priors.
So these successes
include convolutions,
which have been central
to work in vision,
the notion of attention
which has been
very central to a lot of
the recent advances in
natural language processing
and other areas,
also the gating that you find and
LSTMs or highway networks is again,
building in a good structural
prior into your models.
In my early deep learning work,
I was a strong proponent of
the idea of tree-structured
compositional models.
I believe that that gave
the right inductive bias for
many human language tasks and
also some tasks in other domains.
So the model was that you
would take pairs of vectors,
which would initially be
vector representations of words,
and then you'd build from that
a vector representation of
a phrase like very good here,
and then you could continue applying
this same pairwise composition
operation recursively,
and you could build up representations
of phrases and sentences.
We were able to apply this model with
some success to various tasks
including sentiment analysis.
So here, the model could
start reading the sentence,
there are slow and repetitive
parts and start to build
up a composed structure
where it says well,
that's a negative impression.
But it has just enough space
to keep it interesting,
and it builds up a
representation where
that part of the
sentence is positive.
The positivity wins
out so that overall,
this is a positive thing
to say about a movie.
These models aren't only
applicable to language,
you can actually apply
the same ideas in vision.
So the visual scenes
also commonly have
a compositional structure and
so in these experiments that we
did on the Stanford background data
set that is building
up a composition or
representation of a church
from the pieces of
the building and is using that
to understand this visual scene.
Interestingly, what
Battou suggests in
2011 as a model for reasoning,
was essentially exactly
the same kind of tree structured
recursive neural network that
we start employing
for natural language.
He proposed that the path to
universal composition was that
you build an association module A,
in this picture from
his paper that maps
two representations taken from memory
into a new representation
of the same sort and that
that new representation can be
scored by module A in this picture,
the guy in France and
simultaneously the new
representation can be put back into
the short-term memory
where it can be used
recursively to build up a proof tray.
However, for reasons
partly computational efficiency
and partly limitations
and flexibility,
it turns out that tree structured
composition hasn't really
worn out in neural network
research in the last five years,
but there are alternatives.
So one that I already mentioned
was this idea of attention.
If you look at it,
you can think of
attention as almost trees
by another name because
we're effectively putting
a soft tree structure over
the previous nodes to
generate a representation of
the next node and we can apply that
recursively as we work building
up along the sequence,
but we now have soft weights rather
than a rigid tree structure.
A second alternative for
building up something like
the two's function that takes
multiple arguments is to think that
maybe our neural networks could
do what logicians call currying.
So rather than having
multi-argument functions,
like F that's taking
three arguments XYZ,
we could instead build
intermediate compositions.
So we can take one argument at a time
like X and build an
intermediate representation or
function which can then take
the next argument Y and proceed to
build up an overall representation.
It seems reasonable to assume that
neural sequence models could
do this kind of computation.
So in this current line of work,
the hope is that we
can start exploring
neural network architectures
so that rather than
having big fairly unstructured
neural networks that act as
association engines or
correlation engines
that look just for any kind
of patterning in the import,
we could use a model that is more
structured with the prior that
encourages compositional and
transparent multi-step reasoning.
However, we'd like to do this in
models that are still
practically usable.
So for the models that we've built,
we've focused on models that's
still end-to-end differentiable.
There's a ton of work now on building
reinforcement learning models
in which this is not true,
but I still see
the simplicity of end-to-end
differentiable models
as so much easier to
train and work with.
Also, it provides a space
where it's easy to build
models that are still scalable
to reasonably large problems.
So in the work today,
I'm going to concentrate
on showing results
from the area of
visual question answering.
So here we are shown a picture
and we're asked
questions about it and
the suggestion is
that asking questions
is a good way to
assess understanding.
This was a viewpoint that was
put forward very early on.
So long ago they used to
be the Yale AI school.
Who here is old enough to have
heard of the Yale AI school?
Yeah. That's right.
Two or three people.
So one of the members of
the Yale AI school was
Wendy Lehnert who worked on
question-answering and she writes,
"When a person understands the story,
they can demonstrate
their understanding by
answering questions about the story.
Since questions can be devised to
query any aspect of
text comprehension,
the ability to answer questions is
the strongest possible demonstration
of understanding."
The same is true for visual scenes.
Why I've been interested in
visual question answering
is because it's
just seemed like
a better proving ground
for compositional reasoning,
JHHalthough there has been a ton
of work which I've also been
involved in doing textual question
answering systems.
In general, the textual
question-answering work has still been
dominated by lexical semantic
matching and there just hasn't
been very much opportunity to do
multi-step iterative reasoning
in that domain where it
seemed like visual
question-answering gave
a nice place ground to
be looking at multi-step
compositional reasoning.
Okay. So that's my intro on
trying to go from machine
learning to machine reading.
So for the rest of the talk today,
I first of all I want
to tell you about
our initial work on MAC networks
on the clever task,
then tell you a little bit about
a new dataset TQA that we've
developed more recently for
visual question answering
and then tell you
about more fresh off the press,
neural state machine model.
They're doing
visual question answering.
So let's just start off
with the CLEVR data set.
So the CLEVR dataset came
out of considerations of
visual question answering.
So the belief of some of the people
working in visual question
answering that the most used
visual question answering
tasks which came
from Bertrand Prereqs group at
Georgia Tech that it
had led to some kind of research
on language and vision,
but it hadn't really been
a very good testing
ground for actually
doing scene understanding
and compositional reasoning.
So Justin Johnson and colleagues at
Facebook AI Research decided
that they should try and come
up with a diagnostic dataset
that especially focused on
compositional language and
visual reasoning about scenes.
So they went back to
this old classic of
AI blocks world and so they
synthesized block world scenes
in Blender and then they
asked long compositional questions
about the scenes.
So in this one,
there are some purple cubes,
but this one is behind
a metal object.
So there's a metal object,
so that scenes that has
to be this purple cube,
that is left to a large ball.
Well, there's a large ball
and it's left of it.
That seems hopeful.
What material is the cube?
If you haven't seen
this dataset before,
you'd probably say no idea,
but it turns out in this dataset,
all things are made
out of two materials,
they're either metal or rubber.
So if they're not shiny,
the right answer is rubber.
So a number of people have worked on
systems to approach this data set.
But as well as the scene
and the question, answer,
another attribute of
this data set which
was the reflection as to
how it was constructed,
that on the right-hand side,
there's informal representation
of a functional program.
So the way that this data
set was constructed was
that they were building
functional programs that
they could run on these days on
the visual scenes in an abstracted
form and get out of them,
the answer and then
that functional program was
converted into natural language.
So one of the questions on using
this data set then will
come up again later is,
are you just building a system over
just textual questions and answers
and images or are you making use of
these functional program
as added supervision?
So one of the example of
piece of work that has
made use of the functional
program has been
a line of work by Jacob Andreas
and then by Justin Johnson
himself that has explored
neural module networks.
So these are partially
differentiable models
that try to approach
the problem composition way,
but they rely on a strong supervision
of the functional program to
translate queries into a
tree-structured functional programs.
So the first part of the model is
LSTM encoder-decoder model that reads
the textual question and
produces the functional program.
Then the second part of that is
another neural network which
interprets this functional program.
So that neural network
is built out of
custom building blocks for
these different semantic operation.
So there's counting
neural network and
comparing neural network and
filtering neural network and they're
plugged together in
a compositional way and they
interpret this functional program
and then aim to give
you the answer format.
But what I want to
do first today is to
introduce our version of approaching
this problem which
is the MAC Network,
which stand for Memory
Attention Composition as
a neural model for problem
solving and reasoning tasks.
So we want to have
this idea of building
a network with structure that
encourages it to do
multi-step reasoning.
So it should decompose
a problem into a sequence of
explicit reasoning steps and each of
them corresponds to
one MAC cell in our Network.
But on the other hand,
it didn't seem right to as
the neural module network
approach that I just showed
on the previous slide,
that it seemed much to be
spoke by the time you're
custom-designing individual
network units for comparison,
filtering, counting,
and things like that.
That didn't seem sufficiently
generic as a model of intelligence.
So what we wanted is
one universal MAC cell that could
be used to be for everything,
and it's versatile enough
that it can learn to
do different things depending on
the context in which it's applied.
What we're building is
the recurrent neural network,
but the recurrent neural
network is able to
have attention backwards throughout
a sequence of reasoning steps.
So through attention, it can a soft
way represent
a complex reasoning graph,
but in the model, that still has
end-to-end differentiability.
So each MAC cell was one step
of a reasoning system.
So the design we had is of
a more articulated design than
most standard recurrent
neural networks
since the model retains
two recurrent states.
So it has a control
state which is used
to describe the reasoning
operation of the network,
and the control state is
an attention-based average
of a given query,
which in our case is
just the textual question.
Then there is a memory
state which is based on
information which in general we'd
say is being collected
from a knowledge-based,
but in the particular
application here,
the knowledge-based is simply
the image that the system
is looking at.
So the memory is going
to be represented as
a attention-based average over
our image or over our knowledge base.
So there are a couple of things
that are worth noting here.
One is that in our model,
we're not going to make use of
the strong supervision of
the functional programs at all,
we're simply working
from the question,
the picture and attempting
to get the answer.
The other one is this design choice
that we're representing,
thinking in terms of attention,
so attention-based averages
over a given query and
attention-based averages
over the image.
The second model I'll
show later does things
a bit differently and
has more abstraction,
we'll get to that later,
but it also represents everything
as attention-based averages.
I can't prove anything here,
this is just a bet,
but it seemed to us that the use of
attention-based models of this sort
has proved to be very successful.
An immediate good property
of these models
is that attention gives
the easy interpretability.
You can say what words is
the model looking at and
what part of the picture is the
model looking at, but beyond that,
it seems that by grounding models in
the space of attention-based
averages that that appears to be
a useful ways to somehow constrain
and direct the models and
get them to learn more
effectively than if we
just have unconstrained
hidden states.
In slightly more detail,
so here is our MAC cell.
So on the top part of the MAC cell,
we have the control unit which
computes a control state.
So it takes in the question text
and the previous control
state and it will generate
a new control state by focusing
on some aspect of the query.
Then down in the bottom,
we have the memory.
So this part takes in
preceding memory hidden state
and our so-called
knowledge base here,
the image and based on
the control information and
the previous memory state,
it reads some information out
of the knowledge base or image,
and then that generates
a new candidate memory
which is combined with
preceding memories to generate
a new MI memory unit which is then
written into the next memory state,
merging old and new information.
So the bet here is,
at the time we first
started this work,
there had been some quite
prominent work from Deep Mind
on Neural Turing machines and
then differentiable neural computers.
It seemed that although in theory
those models were very powerful,
it seemed like in practice they were
very difficult to
control and indeed I
think to this day they
haven't been demonstrated on
any larger scale problems.
Part of the difficulty comes from
the Turing machines style,
arbitrariness,
if you can read and write
anywhere and that makes it
very difficult to learn
and control these models.
Whereas this model has
a very simple
organizations since you're
sequentially laying down memories,
but each next memory can be done
based on attention over
previous memories.
Let's see. Time goes by fast,
so I probably should do this quickly
so I get onto the later ones.
Very quickly, more
details in the paper.
So the control part
takes the previous control
straight and the query,
it computes a representation,
it then uses attention onto
the words of the, actually
I should explain that.
For the words of the question they
run through a bidirectional LSTM,
so they're contextual words.
So it then uses what's
it's computed up
here to put attention
over the words of
the query in the standard attention
distribution way to produce
a weighted average over the words of
the query that normally
focusing on certain words,
and that gives the next
timesteps control signal.
For the read unit that's taking in
the previous memory state
and the knowledge base,
the picture and then it
is wanting to get something
out of the image.
The early part of that allows
the previous memory to interact with
the image and get stuff out
of it in an associative way,
and then it feeds in the
control state and does
a second round of using that to
put attention over the image.
Again, we have a weighted based on
attention retrieval from the image
and that creates
the new memories state.
So then outside the new
candidate memory state.
So for the new candidate memory state
that then goes into the right unit,
and so the new candidate
memory state then,
itself, is combined with
past memories states
using the control state.
So the control is used to feed
which past memory states to
pay attention to in
a key value attention mechanism.
So that is then used to
calculate a weighted distribution
over previous memories,
the new memory and that then
gives you the new memory state.
The hope is that we can
simulate in that way,
doing a soft but arbitrary DAG of
reasoning of successively
writing new memory states.
So that's one MAC cell.
Then we build
a MAC Network by building
recurrence sequence model that
runs through bunch of those cells,
and that gives us a model
that's efficient,
easy to deploy, and still
fully differentiable.
But it has the capacity to represent
arbitrarily complex reasoning via
Directed Acyclic Graphs in
a soft way through retention.
So let me present
the results of this in
terms of the initially
the clever datasets.
So it had 700,000 training
examples, 150,000 test examples.
The space of answers is very small,
there's metal, rubber, cube,
sphere, few colors,
very small numbers;
zero, one, two, three, four,
five So the baseline is quite high.
So if you answer
the most frequent answer by
question type you're already
at almost 42 percent.
But what I wanted to show is
that this kind of architecture
that had been used
for visual question answering
sort of CNN stacks plus
LSTMs didn't really work on
this data because they didn't
do the necessary reasoning.
So kind of a state of the art
BQA system at this time,
a couple of years ago
only got 52 percent.
Compared to that new
or module networks,
this is Jacob Andreas' work,
did a ton better got 83 percent
which was not too far
off the 92.6 percent that was
reported as human accuracy.
But actually, these problems
were synthesized,
so they are synthesized
images and synthesized
answers based on
the functional programs.
So there's absolutely
no reason why a system
shouldn't be able to get
a 100 percent on this data,
and there's a sort of
a actually a fairly
artificial reason why
human performance was
depressed from a 100 percent,
not just human laziness.
So actually, what's happened
in more recent work was all of
the action moves to
the 95 to 100 percent space,
and so initially
Deep Mind came up with
a relation network model
that got to 95.4 percent,
then Justin Johnson did
a new generation of
neural module networks that
got to almost 97 percent.
Montreal with the film network
got to 97.7 percent and
so relation networks and
film are essentially
both large CNN stacks interleaved
with specialized layers.
So relation nets had a relation
net layer where every pair of
pixels has relationships assessed
between it to understand
binary relations.
Film inserts these conditional
linear normalization layers
that tilt the activations
based on the question.
It's harder to get an intuitive
sense of what that model does.
But anyway, our MAC Network
work super-well.
It gets 98.9 percent which
is essentially half
the remaining error.
But you might be starting to
wonder how meaningful this is when
things are getting so
close to a 100 percent.
So something that I think is
interesting is the following.
That if we look at
learning curves based on
how much training data you
give to the models and
if you instead of giving
the model 700,000
examples you only give them
70,000 training examples,
then what you do is find
that other models such as
the film model or
the neural module network model
of Justin Johnson,
they can't actually learned very
much from 70,000 examples at all.
Even though 70,000 is a lot,
this isn't a very big space.
You'd think you should have
to learn something from
your 70 or 100 examples.
They don't get that far above
baseline performance in
most common classes,
whereas our MacNet model
is already getting
86 percent accuracy
over 70,000 examples.
So I think that's actually an
interesting proof that the design of
the model has the right priors to be
solving these multi-step
inference problems.
The subsequent work.
Justin Johnson also built
a clever humans dataset,
where he collected 18,000 real
questions from humans through
crowd sourcing where
they're told to write
questions hard for
a smart robot to answer.
So these questions have
a more diverse vocabulary,
different reasonings,
skills which might not be
in the original dataset,
and there's a small training
set for fine tuning.
So if you run with
the original data zeros shot on
this CLEVER-Humans dataset you get
these results when none of
the systems work great,
but MacNet does do
a little bit better.
If you fine tune on the training
set you get these results.
Where again, the MacNet
is able to get
more value from the fine tuning set
than any of the other models again,
I think reflecting its ability
to generalize well from
small amounts of data and starts
to get quite good performance.
Here's a couple of examples that
show how we get
the benefits of having
interpretable reasoning from having
the attention over
both the sentence and our image.
So what color is the matte
thing to the right
of the sphere in front
of the teeny blue block?
So this is a very short example
where there are
only three reasoning steps.
So in the first reasoning step it
focuses on teeny blue block
in the language,
and looks at the teeny blue block.
In this second reasoning step
it's the sphere in
front and it's then focusing in
the image on the sphere in front.
Then the third reasoning step
is color is the matte thing and
it's going to the matte thing and
asking its color and
correctly answers purple.
You get better results for
longer sequence of reasoning.
So here's it using
six reasoning steps on this example.
How many objects are either
small objects behind
this tiny metal cylinder or
metallic cubes and front of
the large green metal object?
Often what you see here
is in the first couple
of timesteps in more sort of looks at
the macro structure of
what it's being asked
before it starts to focus visually.
So initially it's
looking at the how many,
the question type, and
well either small,
I'm not sure what it's doing there.
But then the second step
is realizing it's
a disjunction between two things
and then in the third time step at
starts focusing on particular things.
So large, green, metal.
It's looking at the
large, green, metal,
metallic cubes and it's
looking at the metallic cubes.
Teeny metal cylinder
and it's looking at
that small objects behind things.
At any rate, it's not quite
clear how it does the counting
but it does in this example
correctly have the answer of four.
So I guess it's managed to
somehow get that out of
the what's going on.
So that was an initial neural
compositional reasoning engine.
So I hope I've shown that
it seems like we've gotten
some good value from having
a constrained sequence model with
good priors separated out control,
and memory and exploiting attention.
But let me move on and
get to the newest stuff.
So one problem is that
although there were
these early reasoning datasets
including clever, they seem limited.
There's artificial images
and or language.
There's a very small space of
possible objects and attributes.
Although in theory you're
doing compositional reasoning,
the feeling is that
because the amount of data
is so large and the space is so
small that the suspicion is that
these models actually
just memorize molecules,
so they just learn what a red, metal,
sphere is and therefore they have
a lot less compositionality in
them than you'd hope they have.
On the other hand,
the main visual question
answering benchmarks have also
seem somewhat problematic.
So they're a strong what are often
referred to as language biases,
but I think it's mainly actually
more real-world biases and
model's guess based on priors.
So snow covers the ground,
grass is green, and things like that.
There's also visceral biases to
overly salient objects and
it's hard to tell when
systems are going wrong,
what exactly is causing it
and really the questions
are too simple.
The questions are often simply,
what color is the grass?
So little reasoning or
compositionality is required.
So we've worked to produce
this new dataset for
compositional question-answering
over real-world image called,
GQA, and in some sense this is
clever done on a larger scale.
So we start with
real images which come from
the same MS COCO sources
that everybody uses in
the vision world and then generate
questions involving compositional
questions a bit like clever,
and so we generate
10 million compositional questions
overall and then generate
a balanced smaller set of questions
with closely controlled
answer distributions.
The way we do that
is that we make use
of and provide a scene graph
with each image that
represents its semantics and
then the questions
like in Clever come
with a functional program grounded in
the scene graph that
shows that semantics.
Perhaps surprisingly the questions we
build are generated using
a traditional rule-based
question engine.
But this means that it's
just the precise semantics
that a rule-based system can
give you that we're just sort
of turning the scene
graph representation in
a very controlled manner into
natural language questions.
So it's still a controlled language,
but we try to give much
more in the way of
linguistic diversity
and a large vocabulary.
Since we totally can understand
the scenes and we can have
metrics that can assess
the consistency of models answers
and various other metrics.
So in slightly more detail
are starting off point is
the Visual Genome dataset that
my Stanford colleagues and Ranjay
Krishna and Michael Bernstein,
Fei-Fei Li and others developed,
where you're staying off
with the MS COCO pictures.
And then they're putting over this
a scene graph representations
shown at rise where there
are identified objects,
which are the pink things
like helmet, watch, and man.
These objects can have
attributes silver helmet,
black watch, and then they can
be connected together
by relationship.
So the man is wearing a helmet or
the cow is kneeling on the grass.
We produce the improved
visual genome.
So rather than just bounding boxes,
we made use of the last few years and
Computer Vision Technology
and now have massive images.
But more importantly,
the original scene graphs
are completely unrestrained in
their natural language
labeling of things.
So we move to a clear ontology of
concepts by resolving synonyms,
discarding some very
unclear or rare things.
So our ontology has 1,700
objects, 600 attributes,
and 330 relations which are grouped
into 60 categories and subcategories.
We also augment the graphs with
some additional information,
we put in positional
relation information,
some comparative information
of same or different color,
and some other global
information that seems
useful but intend to be
in the scene graphs;
things about the weather
and things like that.
So then, once we have
the scene graphs,
we generate using
our rule-based engine
Natural Language questions.
This is a conventional
rule-based Natural
Language generation system
with probabilistic rules
that we can have
common and uncommon patterns with
the standard context-free
style grammar
that can build up descriptions.
There's then quite a lot of
work behind the scenes which
is then trying to make
these questions be good questions.
So we want to make sure
they're answerable,
they're uniquely answerable, and that
they seem reasonably natural.
So that they're are actually then
a lot of large, somewhat induced,
somewhat hand-built
lexica that underlie
what words and
natural words to use to
describe various kinds of
relationships that might
be more abstractly expressed
in the scene graph.
This gives you an idea of
the differences between VQA and GQA.
So visual question answering
has crowdsourced questions and
crowdsourced answers in
unrestricted Natural Language.
So the VQA questions are
real human questions whereas GQA,
the questions are more artificial,
because we are generating
them from our grammar.
On the other hand, the GQA questions,
we believe, are a better test bed for
exploring scene
understanding and reasoning.
Because the VQA questions
are sort of random.
Does this man need a haircut?
What is different about
the man's shirt that
shows this is for a special occasion?
Rely on a lot of
world knowledge that you can't really
answer things from the question.
Our questions are
more straightforward,
scene-understanding
questions which are
perhaps more artificial sounding.
Is their neck tie in
the picture that's not red?
But at least with thinking that
this is probably a better testbed
for exploring
visual understanding and
compositional reasoning
than VQA has provided.
This isn't one is better,
this is better for a certain purpose
kind of an answer.
Here are the baseline accuracies
that we got with this new dataset.
So global prior is if
you're just saying,
if you're asking a color question,
you give the most
common color answer,
you get whatever about 17 percent.
This is a vision only model.
So there's been a tradition
VQA of seeing how far you can
get with a vision only model
or a language only model.
The local prior is prior
for a particular object.
So this is, if you're asking,
what color is the apple,
you do rather better
than the global prior.
The LSTM is the language only model,
and this is the result
that's always been shown
in current VQA systems
that you can do not
terribly by ignoring
the picture altogether and
just answering based on
the language that you're given.
LSTM+CNN is sort of
the standard baseline VQA system.
Bottom-up is the recent
winning VQA systems
of Peter Anderson and
colleagues that use
bottom-up and top-down attention.
If you look really closely,
you'll notice that our MAC model is
better than any of
these other models,
but somewhat disappointingly,
the MAC model doesn't actually
work for anywhere on this dataset.
That's trying to prove the point
and more abstract reasoning,
and it's way below
humans that are coming
in at about 98 percent accuracy.
So in the last bit of the talk,
I then want to tell you a bit
about a new model that we've been
exploring that tries to do
somewhat better than this.
So the thing that we've been
exploring this new model is,
so for Visual Question
Answering, well,
question and answers are
clearly about language.
But maybe we could actually make
more progress on visual question
answering rather
than having these systems
which have vision on
one side and language
on the other side.
If we instead represented
what we're doing for
visual question answering is doing
everything in terms of
an internal language of thought.
So what we'd like to do is have
a common conceptual representation
which we can represent both language,
which we can use to represent
both language expressions
and visual scenes in,
and then we'll be able to reason
using that and we'll
be able to do better.
So I am no expert on
human visual perception,
but nevertheless my
superficial understanding
of human visual perception
is the idea that brains
have a photo in their brain
that they're using,
isn't supported by most of what's
in human visual perception, instead,
that eyes are making
momentary fixations,
and what they're getting
out of each fixation is
some very high level
scene gist, right?
So you do a fixation on
the man and you see there's
a man looks like
a cyclist with glasses,
helmet, gloves, watch, and
you're aware of the fact
there's a grassland scene with
a cow somewhere off to the right.
But you don't actually get a lot of
detail out of that and
then you'll of succumb,
make another fixation and
maybe you look at the cow,
and then you'll notice it has
horns and a bell on the front
or something like that.
But it's really that they're
fairly abstracted scene gist is
what's actually in the brain as
you do this visual processing.
So in the same way,
our hope is that we
can use concepts to
organize our visual
sensory experience,
and we can build from
those an abstraction,
a world model to represent
what we're seeing in
our environment and
our world models will essentially
be the scene graphs
that I just showed you.
We'll be able to use
those to draw inferences.
So this fits in with
the general deep learning story.
So the hope of deep learning
models is that we could
use the depth to learn
higher level abstractions.
The reason why that
should be useful is
that for surface signals
like visual signals,
there's all sorts of
complications and variations.
But if we can build a
higher level abstractions,
we should be able to disentangle
our representations and that will
allow us to improve generalization.
So in particular, the way we
want to do that is by building
a model that does content-based
attention over concepts.
So attention is
our central operation,
again, that we're using here.
So we're going to have a large set of
concepts and we're going to put
attention over a few concepts.
Crucially what we're going to
explore here is previously,
and in most other vision systems,
you're putting
attention directly over
the image, over pixel space.
But here it's arguing,
maybe if instead we can put
our visual attention over
concept space we'll
be able to do rather
better where concept space
is our language of thought.
So this is related to what
Yoshua Bengio has proposed as
his so-called Consciousness Prior.
I'm not sure I like that name,
but I think the idea
is basically right.
So the general research program
is to say that we should try and
learn deep representations that
disentangle abstract
explanatory factors,
and then his suggestion is that
conscious state has a very low
dimensional vector which is
an attention mechanism over
the disentangle deep representation.
The way we're doing
things is somewhat
different what he proposed us.
But in some sense,
it's the same ideas of
disentangled concept space
putting attention over it.
So this is a model of
the neural state machine.
So we have a vocabulary of
embedded concepts which are
our atomic semantic units,
and that is that cleaned
up visual genome ontology
that I told you about earlier.
Then, but this time,
we're going to translate
both questions and images into
these concepts so that they
both speak the same language.
So everything that's going
to be represented as
attention over concept vocabulary.
So we hope that this
will give us a model of
concept learning and use somewhat
similar to what humans might do.
So a Neural State Machine is
a differentiable graph-based model
which simulates a state machine.
So in some sense, we're hoping
to see it combining some of
the strengths of both neural and
traditional symbolic approaches.
So this is what we do.
We've got two stages
of construction and in France.
So in the construction stage,
we're going to take image and
turn it into a scene graph.
So the way we're doing this is
following other work that's
done imaged a scene graph,
and again including work by
Ranjay Krishna and others.
At Stanford, we're using
similar methods to generate
scene graphs from images.
We do object recognition and
then have further components
that infer relations by
attributes of objects and
relations between objects.
Simultaneously, we run
encoded decoder style model over
their natural language question
and turn into a sequence of
instructions which are
attention over concepts,
now abstracted concepts space.
Then we're going to do
inference which is going to be
a state machine style computation
over our graph to compute an answer.
So formally, we have
something that looks like
a finite state machine
with states and edges.
Then there's a sequence
of instructions that are
going to be like an input,
but we're going to do things
in probabilistic manner.
So we start with the probability
distribution over initial states.
Then we have a neural
state-transition function
which computes probabilities of
next states based on
the current state distribution
and what the instruction is.
So this is it all pictorially.
So from an image,
we construct a scene graph
which happens to also look
like a state machine,
where the states correspond to
objects and the
transitions correspond to
relations and the states
have these properties,
and all of this is representing
the soft way as attention over
concepts in our concept ontology.
So effectively, these attentions
over concepts give us
a disentangled
representation in terms of
concepts and their properties
and relations between them.
So it's all factorized using
our concept vocabulary.
The question is then also
translated into a series
of instructions.
So each instruction is again
an attention distribution
over concepts.
But if we just go with
the argmax of it,
what is the red fruit
inside of the bowl to
the right of the coffee maker
gets interpreted
to instruction sequence of
coffee maker right bowl inside red?
So then we can run these instructions
on our state machine and say well,
start with the coffee maker,
you want to look to the right,
there should be a bowl there.
You want to look inside that,
and there's something red,
and the network will then be
able to say "Yeah that's an
apple" and will then be able to
answer the question correctly.
Here's one more example of that.
So what is the tall object
the left of the bed made of?
So the model will form instruction
sequence bed left tall made.
So it'll first of all
be looking at the bed,
looking to the left of the bed,
confirming there's
something tall there,
then asking what is made of
and will turn the answer red.
But I'm just doing this as
one-hot language in describing it.
Really at each stage,
this is done as
soft attention distributions
and at all times there's
a soft activation level over
the entire of the scene graph.
Now, so here's our new model.
So this is over GQA again.
So these were all the models I
already talked about and MAC.
So we are actually getting
a nice lift there from
the neutral state
machine model which is
now working quite a bit better.
Find point, you'll notice max
doing a bit better
than before as well.
Yeah it turns out if you tune
these models for longer,
you will work out ways to
even do it a bit better.
But nevertheless,
their neural state machine is
doing well better than that.
I hope that that's showing
that having reduced things
both to the languages concepts
gives you some power.
Here a couple more demonstrations.
So there's another interesting
dataset that was done.
I go out and colleagues from
Georgia Tech where
they wanted to give
a better test of
visual question answering
than standard VQA datasets.
So they produced
this VQA-CP dataset where
deliberately the distribution was
changed between the training
set and the test set.
So for example, in
the training set for
what sport is that questions
the two most common answers
with tennis and
baseball that they changed the
distributions on the test set.
The most common sports
showing the skiing,
then baseball soccer,
skateboarding, etc.
So you have a different distribution.
So if you've actually an
understanding concepts,
you should be able to still
get the answer right but if
you're just guessing
based on world priors,
you'll get the answers badly wrong.
So what they showed was most
current models, you can see here.
Here are the neural module networks
that we talked about
before and again,
and various other models
at that point.
It seemed like well no actually
they weren't understanding concepts.
So many of these models their
performance essentially
halved when you went
to the VQA-CP dataset.
So sect attention networks
perform terribly on this dataset.
So I'll go out about produce
their own model that worked better.
Then the interesting this was
actually also a model that mapped
both images and language to concepts.
So they had that idea as well.
Then here are all the 2019 models on
archive which are doing rather
better than their model.
But again, with the neural
state machine model,
I think we're getting
additional power by having
the state machine that we
can simulate reasoning on.
I have one more example of that,
but time is going by.
So maybe I should skip that.
>> You have five minutes.
>> I don't have many slides left.
>> Go forward.
>> So I could do this
example then. Okay, I will.
So we could do the same kind of
thing with GQA that since we
do have scene graphs and functional
programs behind all of GQA,
we can also do
the same trick that they did
of changing distributions.
So we can have
distinctions in structure
between training and testing.
So we can have the
training only using
some linguistic constructs
from a notion of
covering or what
something is made of,
and at testing doing
different kinds of questions,
or we can have different content.
So at training, we can not have
any questions that refer to
types of foods or animals.
At testing time, we do have
questions of that sort.
The fine point here is, I mean,
it's not that thing we don't
have completely novel objects but
it's not that we're asking at
test time what's this animal
and as a red panda and
I never saw a red panda
at training time.
So far, objects in isolation,
we've got our trained object
detector which is seen all of
the objects that map onto
our concept vocabulary
in the training data,
but that's only
learning in isolation.
So this is saying "Can it
compose together
that knowledge to be able to
answer new kinds of
questions that it hadn't been
asked about at training time?"
So here are our results here.
Again, the MAC Network will argue
as a bit along the right lines,
but the neural state
machine is much more
effectively being
able to generalize to
these new types of
questions where some of
the more standard VQA models,
yeah I thought they don't
work at all but perform
fairly poorly at generalizing to
new types of things like this.
So in conclusion, yes.
So VQA as a problem
has always been considered
to be about language,
but I think it might be
interesting to explore whether
you can think in terms
of a conceptual space
or a language of thought where you
use a common representation for
different modalities for reasoning
about and that might actually
give additional power,
and I've outlined
a model that did that.
I mean, in general,
I think there's
this exciting question of,
can we take our neural
network models that have been
so successful for
sensory perception tasks,
and work out whether we can also
use them for thinking slow tasks
involving understanding and
multi-step compositional reasoning?
So the overall goal is,
can we build neural
networks that think?
I think there's actually
a reasonable hope that we
might be able to do that.
I've suggested that
at least one promising direction
has been to do this by
trying to build models which use
attention over abstracted
disentangled concepts,
and then do multi-step
reasoning by having
an iterative attention process
over different time steps. Thank you.
>> Thanks, Chris, for a fabulous
and thought-provoking talk.
Why don't we spend the next 10 or
15 minutes answering questions,
and then give you
a little bit of a break before
you head off to your next.
>> Sure.
>> You can pick.
>> Okay.
>> I had like in all these
compositional reasoning data set,
one thing that's clearly
missing is hierarchy.
So initially, the picture of
all these people like tall woman,
short boy, and man.
First that come out to me was, "Wait,
it's a family," but family was
never mentioned in the scene graph.
So that hierarchy of
composition beneath
four people in the next step,
actually constituted family.
So do you think we should also be
pursuing this certain ideologies?
>> Yes. So I think that's
a completely valid observation.
>> Can you repeat the question?
>> Sorry. The question was
these models are doing
some multi-step reasoning.
Yes. That they're not representing
hierarchy or perhaps not
even really representing
compositionality,
and it seems like they should.
I mean, that actually goes back to
this early picture I
showed right there.
But I think there's no doubt
at all that there's lots of
need to represent composition
hierarchy because a lot of the time,
yes, we're thinking
in terms of these.
Hierarchical models where we're
seeing bigger composite holes,
and I agree we haven't
really been addressing that,
but I think we should.
We did in this old work in 2011,
and I'll get back to one of
these days. Right there.
>> Yeah. So it's a question
about the testing methodology.
So at the end, you should have
the distribution on the training
of the testing was different,
you get worse result.
Is there any effort to
systematically test
every possible object in the scene,
or it seems like you could be,
since you're generating these queries
and the scenes automatically,
you could just
systematically generate
all the possible inputs, right?
>> So yeah. I mean,
that's not something I
talked about, but I mean,
that's actually one of
the things that we aim.
So square was down here,
not for explained very
well on that last point.
I mean, that was something
that we saw as one of
the big advantages of having
the controlled questions of
GQA that we can generate lots of
questions which are effectively
answering the same or related things.
So we can answer,
what color is the go-kart?
Is the go-kart red?
Is the go-kart green?
We could assess automatically
whether a system is giving
consistent answers,
and it turns out that
a lot of the time,
GQA systems don't give
consistent answers,
and we could do that for
relational ones as well.
We could say, what is to
the left of the go-kart?
Maybe there's a person
then we can say,
what are some of them ride of
the person and ask
backwards and so on.
So better assess actual understanding
rather than random guessing.
So we actually have
some metrics that we defined
that look at consistency of
answering and have gotten
some results on that.
>> Thanks.
>> Yeah.
>> So I have a question about
the neural state machine,
and currently I think
you have 117 object,
and that the object recognition
on certain natural
recovery objectives.
I think performance
should be very bad.
So I think the accuracy of
object detection is even
lower than the accuracy
of the VQ times
the final VQA answer accuracy.
So how can you really get
such a high accuracy to
answer the questions,
but based on a very poor performance
on object detection?
>> So I think maybe there
are two answers to that.
I mean, I think
one answer to that is,
it turns out that is
a distribution of data question.
It turns out that a lot
of the stuff that's
in the scene graphs that are
ultimately come from Visual Genome
though we've normalized them,
is fairly simple stuff, right?
That there's a lot of
things that's talking
about people and fruits
and cars and airplanes.
So even though we do some things
to rebalance the data,
that is core concepts,
basic categories,
if you will in
psychological terms that are
used a lot and a much fewer
in number, so that helps.
I think the other part of the answer
is we don't have to get
the answer exactly right
because each point we're putting
an attention distribution
across the space of concepts.
So to the extent that
the model roughly thinks that,
"I was probably one of these four
things and I don't know which."
The model can place
non-trivial attention over all of
those four things and
reason forward from there,
even though, if it was just
guessing at random between them,
then it only be getting
25 percent accuracy.
>> We will have
some component odds because
some run on object detection,
and then the reasoning model
for this run scene,
so do something in random.
In fact, final answer is correct.
Just like routing, routing
and then get it right,
so we do not learn anything,
we do not learn that use
for with an index.
>> So there's can be a chance
of getting things wrong,
but there's also a chance
that the multi-step
reasoning could actually
be more human-like because if
you say a go-kart to
the right of a person,
well even if you're go-kart
detector isn't that good,
if you know it's to
the right of a person, well,
you can say, "Oh, that thing over
there must be a go-kart," right?
So the actual relational reasoning
can actually help you do better.
The one person behind there.
>> So it seems that there are
two main approaches to
mark such reasoning.
One is explicit modular instructions
versus how many universal
continuous actions,
and it seems that you have tried both
of them involved in
two of your works.
One is the MAC cell which
is a universal cell,
as opposed to your recent work,
which is more like
a discrete instructions.
So from your experience,
which one of these
approaches you would
think would be more
fruitful to pursue?
>> Yeah. So I guess I'm
right at that moment,
I'll say the neural
state machine because
look it seems to be working there.
But I think, there's clearly
a wide open space of approaches,
then it's not really clear
that I or anyone know
the best approach, but you're right.
So in some sense,
the neural state machine
approach is more
specific with these different things
compared to the MAC architecture.
But it's still not as specific and
custom as something like
the neural modular networks, right?
That it's still being
done as there aren't
special units to do anything like
spatial reasoning or counting
or anything like that, right?
That it's still doing a more general
attention-based computation.
So in some sense,
I still see it as similar to
the MAC family. I saw him, yeah.
>> I have two questions. One is,
the results that you present,
you're not based on the
ground two [inaudible] ,
so do you guys get the results
for ground two [inaudible] ,
no object detected,
no relation detected,
just the ground two [inaudible] ,
and use your architecture
on the top of that.
So the question is,
you're using a vision system and
there's an answer that you should
be able to get where you just say,
here's the gold scene graph
for this scene,
you learn it on that,
and then really you're back in
that state then you
should be able to get
a 100 percent of the question
and the gold scene graph.
No, I don't have those answers,
and you know what,
that would be
a interesting thing to do.
I mean, I think clearly it'll
be much higher number, right?
Because here there's two parts of
this question: one of which is
what is the defaulty of the vision
problem of recognizing things,
and the other one is how successfully
you actually reason across
these scene graphs?
>> I'm going to rule out questions
from the NLP team and MSRAI,
and the Deep Learning team
and MSRAI because
you're going to meet with
Chris over the next few hours.
So any non-MSRAI [inaudible].
Lots of them raising hands.
>> I have a more
high-level questions.
So if we combine
the revision and NLP,
it's BQA or TQA and
my background is ASR,
so my question is really how
can ASR be combined with,
or do you think one day we will be
working on something like combining
all the three [inaudible] ;
ASR, CV, and NLP?
>> So, yeah, I haven't done that.
So I mean, I have nothing against
speech recognition, of course.
In some sense, so speech
recognition hasn't seen
the right place to play for
trying to do this
high level reasoning.
I mean, on the one hand,
there's clearly a connection, right?
When humans are doing
speech recognition,
they have a much
higher level understanding
of where they are and
what's being talked about,
and they use that to help with
their speech recognition,
and you should be
able to do the same.
So yes, we should be
able to combine all of
these modalities together
and do much more.
I mean, in practice
that's tended not to be what
happens in speech systems,
and the speech is recognized to
the word level before
anything else happens,
and to the extent that that's true.
If you're more interested on
higher level of reasoning,
incorporating in speech hasn't
seemed the easy way to
make the problem more
interesting where
doing multi-modal work
with vision has seen
much more approachable.
Anyone on this side
have a question? Yes.
>> So you talked about how
the clever dataset was basically
a very large amount of data,
but a very small space.
So you basically make a bigger
version of that in the form of GQA.
So what's the intuition on
how complex you want to
make this datasets so that
these high capacity models don't
actually memorize it versus
actually learn to reason on them.
>> Yeah, I mean,
I can say my thoughts,
but they're the usual half
informed thoughts.
Yes, so I mean,
I guess in research
somehow you want to find
the right sweet spot where there
things aren't too difficult,
but it presents the right kind
of challenges and
questions for what
you want to pursue.
I guess my feeling coming into this
was that if you compared textural
question-answering versus
visual question answering that
even though they'd been
progress every year on
visual question answering,
there's a way in which the problem
seemed too hard and too diverse,
so that whereas for textural
question-answering,
there's been a really
good ramp, in fact,
in some sense that
might seem like it's
almost too easy with
all of these results on
textural question-answering
claiming better than
human results on squad
and other datasets.
For visual question answering
it seemed like
the space of questions,
and knowledge, and so on was too
variegated and hard for
the current systems.
But on the other hand,
the blocks world system seemed like
they were too small and too easy,
and so we were looking for
something that was in the middle,
which presented reasoning challenges
over real vision problems,
but is still more
constrained than just
saying Turkers come out
with some question to ask.
At that point, it's a bet
effectively that you're betting
that this might be
an interesting level of
difficulty for people to explore,
for working on scene understanding,
and it can certainly
be criticized because
our questions are still artificial
when it comes down to it.
We hope that they're from a broad
enough space that it's actually
an interesting good challenge for
developing visual scene
understanding,
reasoning systems for a few years.
>> One last question.
>> Okay. [inaudible]. Yeah.
>> So in your opinion,
what do you think is the role of
logical inferences in
these recent exercise?
At least it seems to me that
it might make sense for
these models to know that if
something is black, it cannot be red.
If A is to the left of B and
C is to the left of whatever,
[inaudible] and so on,
do you think there's a role for
logical inferences in this?
>> So certainly, there
is a role for having
more understanding of domains,
and being able to make use
of that understanding.
So I mean, there's
perhaps slightly more of
that than I fully made clear, right?
So by construction,
so when I said there
was a space of concepts
that we put into taxonomy,
so we actually are
exploiting that taxonomy,
and this is how we
have modally hand-done,
disentangled representation.
So that we have taxonomic
concepts like colors,
and so then for
the property of color,
you'll have attention distribution
over different colors.
So in a soft way,
it is representing that the choice of
colors are complementary
from each other.
So we are making some use of
that kind of information,
but there's certainly a lot of
other reasoning information
like there to the left
of and to the right of being
opposites we aren't capturing.
So yeah, I think there is
absolutely a role to be doing
more inference of a logical
character in future models.
>> Okay. With that, let's thank
Chris again for his [inaudible]
