The following content is
provided under a Creative
Commons license.
Your support will help
MIT OpenCourseWare
continue to offer high quality
educational resources for free.
To make a donation or
view additional materials
from hundreds of MIT courses,
visit MIT OpenCourseWare
at ocw.mit.edu.
JOSH TENENBAUM: So where we
left off was, you know, again,
I was telling you a story, both
conceptual and motivational
and a little bit technical
about how we got to the things
we're trying to do now
as part of the center.
And it involves, again both
the problems we want to solve.
We want to understand
what is this common sense
knowledge about
the physical world
and the psychological world that
you can see in some form even
in young infants?
And what are the
learning mechanisms
that build it and grow it?
And then what's the kind
of technical ideas that
are going to be also
hopefully useful for building
intelligent robots or
other AI systems that
can explain on the scientific
side how this stuff works?
All right, so that
was all this business.
And what I was suggesting, or
I'll start to suggest here now,
is that it goes
back to that quote
I gave at the beginning
from Craik, right?
The guy who in 1943
wrote this book
called The Nature
of Explanation.
And he was saying that's the
essence of intelligence is
this ability to
build models that
allow you to explain the
world, to then reason,
simulate, plan, and so on.
And I think we need
tools for understanding
how the brain is a modeling
engine or an explaining engine.
Or to get a little bit
recursive about it,
since what we're
doing in science
is also an explanatory
activity, we
need modeling
engines in which we
can build models of the
brain as a modeling engine.
And that's where the
probabilistic programs
are going to come in.
So part of why I spent a
while in the morning talking
about these graphical
models and the ways
that we tried and, I
think, made progress
on, but ultimately
we're dissatisfied with,
talking about how we're modeling
various aspects of cognition
with these kinds of
graphical models.
I put up-- I didn't say too much
about the technical details.
That's fine.
You can read a lot
about it or not.
But these ways of using
graphs, mostly directed graphs,
to capture something about
the structure of the world.
And then you put probabilities
on it in some way,
like a diffusion process
or a noisy transmission
process for a food web.
That's a style of
reasoning that sometimes
goes by the name of
Bayesian networks
or causal graphical models.
It's been hugely influential
in computer science,
many different fields, not just
AI, and many fields outside
of computer science.
Not just cognitive science,
neuroscience-- many areas
of science and engineering.
Here are just a few examples
of some Bayesian networks
you get if you search on Google
image for Bayesian networks.
And if you look
carefully, you'll
see they come from biology,
economics, many chemical
engineering, whatever.
They're due to many people.
Maybe more than
anyone, the person
who's most associated with
this idea and with the name
Bayesian networks
is Judea Pearl.
He received the
Turing Award, which
is like the highest reward
in computer science.
This is a language
that we were using
in all the projects you saw
up until now in some form
that we and many others used.
Because they provide a
powerful set of tools
for general purpose tools.
It goes back to this dream
of building general purpose
systems for
understanding the world.
So these provide general
purpose languages
for representing
causal structure--
I'll say a little
bit more about that--
and general purpose
algorithms for doing
the probabilistic
inference on this.
So we talked about
ways of combining
sophisticated
statistical inference
with knowledge
representation that's
causal and compositional.
These models-- I'll just
tell you a little bit
about the one in the
upper left up there,
that thing that says
diseases and symptoms.
It is causal.
It is compositional.
It does support
probabilistic inference.
And it was the
heart of why we were
doing what we were
doing and showing you
how different kinds of
causal graphical models
basically could capture
different modes of people's
reasoning.
And the idea that maybe
learning about different domains
was learning those different
kinds of graphs structures.
So let me say a little
bit about how it works
and then why it's not enough,
because it really isn't enough.
I mean, it's the right start.
It's definitely in
the right direction.
But we need to go beyond it.
That's where the problematic
programs come in.
So look at that network up
there on the upper left.
It's one of the most
famous Bayesian networks.
It's a textbook example.
One of the first
actually implemented
AI systems was based on this for
a system for medical diagnosis.
Sort of a simple approximation
to what a general practitioner
might be doing if a patient
comes in and reports
some pattern of symptoms, and
they want to figure out what's
wrong.
So diagnosis of a disease
to explain the symptoms.
The graph is a bipartite graph.
So two sets of nodes
with the arrows,
again, going down in
the causal direction.
The bottom layer, the
symptoms, are the things
that you can nominally observe.
A patient comes in
reporting some symptoms.
Not all are observed,
but others maybe
are things that you could test,
like medical test results.
And then the top
level is this level
of latent structure, the
causes, the things that
cause the symptoms.
The arrows represent
basically which
diseases cause which symptoms.
In this model there's roughly
500, 600 diseases-- you know,
the commonish ones,
not all that common--
and 4,000 symptoms.
So it's a big model.
And in some sense,
you can think of it
as a big probability model.
It's a way of specifying a
joint distribution on this
4,600-dimensional space.
But it's a very particular one
that's causally structured.
It represents only the minimal
causal dependencies and really
only the minimal
probabilistic dependencies.
That sparsity is really
important for how you use it,
whether you're talking
about inference or learning.
So inference means observing
patterns of symptoms
or just observing the values
of some of those variables
and making guesses
about the others.
Like observing some
symptoms and making
guesses about the
diseases that are most
likely to have explained those.
Or you might make a prediction
about other symptoms
you could observe.
So you could go up
and then back down.
You could say, well,
from these symptoms,
I think the patient might have
one of these two rare diseases.
I don't know which one.
But if it was this
disease, then it
would predict that symptom
or that test maybe.
But this disease
wouldn't, so then that
suggests a way to plan
an action you could
take to figure things out.
So then I could go
test for that symptom,
and that would tell me which of
these diseases the patient has.
They're also useful in planning
other kinds of treatments,
interventions.
Like if you want
to cure someone--
again, we all know
this intuitively--
you should try to cure the
disease, not the symptom.
If you have some
way to act to change
the state of one of
those disease variables
to kind of turn it
off, reasonably that
should relieve the symptoms.
If that disease gets turned off,
these symptoms should turn off.
Whereas just
treating the symptom
like taking Advil for
a headache is fine
if that's all the problem is.
But if it's being caused
by something, you know,
god forbid, like a brain
tumor, it's not going to help.
It's not going to cure the
problem in the long term.
OK, so all those patterns of
causal inference, reasoning,
prediction, action
planning, exploration
is a beautiful language
for capturing all of those.
You can automate all
those inferences.
Why isn't it enough, then, for
capturing commonsense reasoning
or this approach to cognition?
Which I'm calling the kind
of model-building explaining
part, as opposed to the
pattern recognition part.
I mean, again, I don't
want to get too far
behind in talking about this.
But that example is so rich.
Like if you can build
a neural network,
you can just turn
the arrows around
to learn a mapping from
symptoms to diseases,
and that would be a
pattern classifier.
So these two different
paradigms for intelligence--
as some of the questions we're
getting at, and as I will show
versions of that with some
more interesting examples
in a little bit--
often it's very subtle, and
the relations between them
are quite valuable.
So one way to work with
such a model, for example,
or one nice way--
I mean, I mentioned a lot
of people want to know--
and I'll keep talking about
this for the rest of the hour--
productive ways to combine
these powerful generative
models with more pattern
recognition approaches.
For some classes of this model--
there are always general
purpose algorithms
that can support
these inferences, that
can tell you what diseases
you're likely to have given
what symptoms.
But in some cases, they
could be very fast.
In other cases, they
could be very slow.
Whereas if you
could imagine trying
to learn a neural network
that looks just like that,
only the arrows go up, so
they implement a mapping
from data to
diseases, that could
help to do much faster
inference in the cases
where that's possible.
So that's just one
example of where
a model, which might
be not a crazy way
to think about, for
example, more generally
the way top-down and bottom-up
connections work in the brain.
I'll take that a little bit more
literally in a vision example
in a second.
So there's a lot you
can get from studying
these causal graphical models,
including some version of what
it is for the mind
to explain the world
and how that explanation
and pattern recognition
approach can work together.
But it's not enough
to really get
at the heart of common sense
The mental generative models
we build are more
richly structured.
They're more like programs.
What do I mean by that?
Well, here I'm giving a bunch of
examples of scientific theories
or models.
Not commonsense ones, but I
think the same idea applies.
Ways of, again,
explaining the world,
not just describing the pattern.
So we went at the beginning
through Newton's laws
versus Kepler's laws.
That's just one example.
And you might not have thought
of those laws as a program,
but they're certainly
not a graph.
On the first slide when
I showed Newton's laws,
there was a bunch of
symbols, statements
in English, some math.
But what it comes
down to is basically
a set of pieces of
code that you could
run to generate the orbits.
It doesn't describe the
sheep or the velocities,
but it's a machine that
you plug in some things.
You plug in some
masses, some objects,
some initial conditions.
And you press run, and it
generates the orbits, just
like what you're seeing there.
Although those probably
weren't generated.
That's a GIF.
OK.
That's more like
Kepler or Ptolemy.
But anyway, it's a
powerful machine.
It's a machine,
which if you put down
the right masses in
the right position,
they don't just all
go around in ellipses.
Some of them are
like moons, and they
will go around the things that
will go around the others.
And some of them will be
like apples on the Earth,
and they won't go
around anything.
They'll just fall down.
So that's the powerful machine.
And in the really simplest
cases, that machine--
those equations can be
solved analytically.
You can use calculus or
other methods of analysis
like Newton did.
He didn't have a computer.
And you can show that for a
two-body system, one planet
and one sun, you can solve
those equations to show
that you get Kepler's law.
Amazing.
And under the approximation
that only the sun is--
for every other
planet, it's only
the sun that's exerting
a significant influence,
you can describe all of
Kepler's laws this way.
But once you have more
than two bodies interacting
in some complex way,
like three masses similar
in size near each
other, you can't
solve the equations
analytically anymore.
You basically just have
to run a simulation.
For the most part, the
world is complicated,
and our models have to be run.
Here's a model of a
riverbed formation.
Or these are snapshots of a
model of galaxy collision,
you know, and climate
modeling or aerodynamics.
So basically what
most modern science
is is you write
down descriptions
of the causal processes,
something going on
in the world, and you study
that through some combination
of analysis and simulation
to see what would happen.
If you want to
estimate parameters,
you try out some guesses
of the parameters.
And you run this thing, and
you see if its behavior looks
like the data you observe.
If you are trying to decide
between two different models,
you simulate each
of them, and you
see which one looks more
like the data you observe.
If you think there's something
wrong with your model--
it doesn't quite look
like the data you observe.
You think, how could I
change my model, which
basically if I run it, it'll
look more like the data I
observe in some important way?
Those activities of science--
those are, in some
form I'm arguing,
the activities of common
sense explanation.
So when I'm talking about
the child as scientist,
that's what I'm
basically talking about.
It's some version of that.
And so that includes
both using--
describing the causal processes
with a program that you run.
Or if you want to
talk about learning,
the scientific analog is
building one of these theories.
You don't build a
theory, whether it's
Newton's laws or Mendel's
laws or any of these things,
by just finding
patterns and data.
You do something like
this program thing,
but kind of recursively.
Think of you having some kind
of paradigm, some program that
generates programs,
and you use it
to try to somehow search
the space of programs
to come up with a program
that fits your data well.
OK, so that's, again,
kind of the big picture.
And now, let's talk
about how we can actually
do something with this
idea-- use these programs.
And you might be wondering,
OK, maybe I understand--
I'm realizing I didn't
say the main thing I
want you to understand.
The main thing I want
you to get from this
is how programs
go beyond graphs.
So none of these
processes here can nicely
describe with a
graph the way we have
in the language of
graphical models.
So the interesting causality--
I mean, in some sense,
there's kind of a graph.
You can talk about the state
of the world at time T,
and I'll show you graphs
like this in a second.
The state of the world at
time T plus 1 and an arrow
forward in time.
But all the interesting stuff
that science really gains power
from are the much more
fine-grained structure
captured in equations
or functions
that describe exactly
how all this stuff works.
And it needs languages
like math or C++ or LISP.
It needs a symbolic
language of processes
to really do justice to.
The second thing I
want to get, which
will take a minute to get,
but let's put it out there.
As yes, OK, maybe
you get the idea
that programs can be used
to describe causal processes
in interesting ways.
But where is the
probability part come in?
So the same thing is actually
true in graphical models.
How many people have
read Judea Pearl's 2000
book called Causality?
How many people have
read his '88 book?
Or nobody's read anything.
But, OK, so what Pearl
is most famous for--
I mean, when we say Pearl's
famous for inventing
Bayesian networks,
that's based on work
he did in the '80s,
in which, yes,
they were all
probability models.
But then he came
to what he calls,
and I would call, too, a deeper
view in which it was really
about basically deterministic
causal relations.
Basically it was a graphical
language for equations--
certain classes of equations
like structural equations.
If you know about linear
structural equations,
it was sort of like nonlinear
structural equations.
And then probabilities
are these things
you put on just on
top of it to capture
the things you don't know
that you're uncertain about.
And I think he was he
was getting at the fact
that to scientists,
and also to people--
there's some very nice work
by Laura Schultz and Jessica
Sommerville, both of whom will
be here next week actually,
on how children's
concepts of causality
are basically
deterministic at the core.
And where the
probabilities come in
is on the things
that we don't observe
or the things we don't
know, the uncertainty.
It's not that the
world is noisy.
It's that we believe, at least--
except for quantum mechanics--
but our intuitive notions
are that the world is
basically deterministic,
but with a lot of
stuff we don't know.
This was, for example, Laplace's
view in philosophy of science.
And really until
quantum mechanics,
it was broadly the
Enlightenment science view
that the world is full of all
these complicated deterministic
machines, and where uncertainty
comes from the things
that we can't observe or that
we can't measure finely enough,
or they're just in some form
unknown or unknowable to us.
Does that make sense?
So you'll see more
of this in a second.
But where the
probabilities are going
to come from is basically if
there are inputs to the program
that we don't know
or parameters we
don't know that in
order to simulate them
we're going to have to
put distributions on those
and make some guesses
and then see what
happens for different guesses.
Does that make sense?
OK.
Good.
So again, that's most of the
technical stuff I need to say.
And you'll learn about how
this works in much more
concrete details if
you go to the tutorial
afterwards that Tomer
is going to run.
What you'll see there is this.
So here are just a few examples.
Many of you hopefully already
looked at the web pages
from this probmods.org thing.
And what you see here is
basically each of these boxes
is a probabilistic
program model.
Most of it is a bunch
of defined statements.
So if you look here, you'll
see these defined statements.
Those are just
defining functions.
They name the function.
They take some inputs,
which call other functions,
and then they maybe
do something--
they have some output
that might be an object.
It might itself be a function.
These can be functions that
generate other functions.
And where the
probabilities come in
is that sometimes these
functions call random number
of generators basically.
If you look carefully,
you'll see things
like Dirichlet, or uniform
draw, or Gaussian, or flip.
Right those are primitive random
functions that flip a coin,
or roll a die, or
draw from a Gaussian.
And those captured things
that are currently unknown.
In a very important sense, the
particular language, Church,
that you're going to learn here
with its sort of stochastic
LISP--
basically just functions
that call other functions
and maybe add in some
randomness to that--
is very much analogous to the
directed graph of a Bayesian
network.
In a Bayesian network,
you have nodes and arrows.
And the parents of a node, the
ones that send arrows to it,
are basically the
minimal set of variables
that if you were going
to sample from this model
you'd have to sample first in
order to then sample the child
variable.
Because those are the
key things it depends on.
And you can have a
multi-layered Bayesian network
that, if you are going
to sample from it,
it's just you start at the
top and you sort of go down.
That's exactly
the same thing you
have in these probabilistic
programs where
the defined statements are
basically defining a function.
And the functions are the
nodes, and the other functions
that they call as
part of the statement
are the things that are the
nodes that send arrows there.
But the key is, as you can
imagine if you've ever--
I mean, all of you have
written computer programs--
is that only very
simple programs look
like directed acyclic graphs.
And that's what a
Bayesian network is.
It's very easy and
often necessary
to write a program to really
capture something causally
interesting in the
world where it's not
a directed acyclic route.
There's all sorts of cycles.
There's recursion.
One thing that a function can
do is make a whole other graph.
Or often it might be
directed and acyclic,
but all the interesting
stuff is kind
of going on inside what happens
when you evaluate one function.
So if you were to
draw it as a graph,
it might look like you could
draw a directed acyclic graph,
but all the
interesting stuff will
be going on inside
one node or one arrow.
So let me get more specific
about the particular kind
of programs that we're
going to be talking about.
In a probabilistic
programming language
like Church, or in general
in this view of the mind,
we're interested in being
able to build really
any kind of thing.
Again, there's lots
of big dreams here.
Like I was saying
before, I felt like we
had to give up on
some dreams, but we've
replaced it with
even grander ones,
like probabilistic
modeling engines that
can do any computable model.
But in the spirit of trying to
scale up from something that we
can get traction on, what
I've been focusing on
in a lot of my work
recently and what we've
been doing as part
of the center,
are particular
probabilistic programs
that we think can capture this
very early core of common sense
intuitive physics and intuitive
psychology in young kids.
It's what I called-- and I
remember I mentioned this
in the first lecture--
this game engine in your head.
So it's programs for graphics
engines, physics engines,
planning engines, the
basic kinds of things
you might use to build one of
these immersive video games.
And we think if you wrap
those inside this framework
for probabilistic
inference, then
that's a powerful way to do
the kind of common sense seen
understanding, whether
in these adult versions
or in the young kid versions.
Now, to specify this
probabilistic programs
view, just like with
Bayesian networks
or these graphical models, we
wanted general purpose tools
for representing interesting
things in the world
and for computing the
inferences that we want.
Again, which means basically
observing, say, just like you
observe some of the
symptoms and you
want to compute the
likely diseases that best
explain the observed symptoms.
Here we talk about observing
the outputs of some
of these programs,
like the image
that's the output of
a graphics program.
And we want to work backwards
and make a guess at the world
state, the input to
the graphics engine
that's most likely to
have produced the image.
That's the analog of getting
diseases from symptoms.
Or again, that's our
explanation right there.
And there are lots of different
algorithms for doing this.
I'm not going to say
too much about them.
Tomer will say a little
bit more in the afternoon.
The main thing I
will do is, I will
say that the main
general purpose
algorithms for inference in
probabilistic programming
language are in the
category of slow
and slower and
really, really slow.
And this is one of the
many ways in which there's
no magic or no free lunch.
Across all of AI and
cognitive science,
when you build very
powerful representations,
doing inference with
them becomes very hard.
It's part of why
people often like
things like neural networks.
They're much weaker
representations,
but inference can
be much faster.
And at the moment, the only
totally general purpose
algorithms for doing inference
with probabilistic programs
are slow.
But first of all,
they're getting faster.
People are coming up with--
and I can talk about this
offline where that's going--
but also-- and this is what
I'll talk about in a sharper way
in a second--
there are particular classes
of probabilistic programs,
in particular, the ones in
the game engine in your head.
Like for vision is
inverse graphics and maybe
some things about physics
and psychology too.
I mean, again, I'm just thinking
of the stuff of like what's
going on with it
when a kid is playing
with some objects
around them and thinking
about what other people might
think about those things.
It's just that
setting where we think
that you can build sort of in
some sense special purpose.
I mean, they're
still pretty general.
But inference algorithms
for doing inference
in probabilistic programs,
getting the causes
from the effects that
are much, much faster
than things that
could work on just
arbitrary probabilistic programs
and that actually often look
a lot like neural networks.
And in particular,
we can directly
use, say for example, deep
convolutional neural networks
to build these
recognition programs
or basically inference
programs that
work by pattern recognition
in, for example,
an inverse graphics
approach to vision.
So that's what I'll
show you basically now.
I'm going to start
off by just working
through a couple
of these arrows.
I'm going to first talk about
this sort of approach we've
done to tackle both
vision as inverse graphics
and some intuitive
physics on the scene
recovered and then say a
little bit about the intuitive
psychology side.
Here's an example of the kind of
specific domain we've studied.
It's like our Atari setting.
It's a kind of video game
inspired by the real game
Jenga.
Jenga's this cool game you
play with wooden blocks.
You start off with a very, very,
very nicely stacked up thing
and you take turns
removing the blocks.
And the player who
removes the block that
makes the whole thing fall
over is the one who loses.
And it really exercises
this part of your brain
that we've been studying
here, which is an ability
to reason about stability and
support I very briefly went
over this, but this
is something that
is one of the classic case
studies of infant object
knowledge.
Looking at how basically
these concepts develop
in some really interesting ways
over the first year of life.
Though what we're doing here
is building models and testing
them primarily with adults.
It is part of what we're trying
to do in our brains, minds,
and machines research
program here,
collaboration with
Liz and others,
to actually test these ideas
in experiments with infants.
But what I'll show you is
just kind of think of it
as like infant-inspired
adult intuitive physics
where we build and test the
models in an easier way,
and then we're taking it
down to kids going forward.
So the kind of
experiment we can do
with adults is show them
these configurations of blocks
and say, for example,
how stable under gravity
is one of these towers
or configurations?
So like everything
else, you can make
a judgment on a scale of
zero to 10 or one to seven.
And probably most
people would agree
that the ones in the upper left
are relatively stable, meaning
if you just sort of
run gravity on it
it's not going to fall over.
Whereas the ones
in the lower right
are much more likely
to fall under gravity.
Fair enough?
That's what people say.
OK.
So that's the kind of thing
we'd like to be able to explain
as well as many
other judgments you
could make about
this simple, but not
that simple world of objects.
And again, you can see
how in principle this
could very nicely interface with
what Demis was talking about.
He talked about their ambition
to do the SHRDLU task, which
was this ability to
basically have a system that
can take in instructions
and language
and manipulate objects
and blocks world.
They are very far from that.
Everybody's really far from
having a general purpose
system that can do that in
any way like a human does.
But we think we're building some
of the common sense knowledge
about the physical
world that would
be necessary to get something
like that to work or to explain
how kids play with blocks,
play with each other,
talk to each other while they're
playing with blocks and so on.
So the first step
is the vision part.
In this picture here, it's
that blue graphics arrow.
Here's another way into it.
We want to be able to take a
2D image and work backwards
to the world state,
the kind of world state
that can support
physical reasoning.
Again, remember
these buzzwords--
explaining the mind
with generative models
that are causal
and compositional.
We want a description
of the world which
supports causal
reasoning of the sort
that physics is doing,
like forces interacting
with each other.
So it's got to have things
that can exert force
and can suffer forces.
It's got to have
mass in some form.
It's got to be
compositional because you've
got to be able to pick up
a block and take it away.
Or if I have these blocks over
here and these blocks over here
and I want to put these
ones on top of there,
the world state has
to be able to support
any number of objects
in any configuration
and to literally
compose a representation
of a world of objects
that are composed together
to make bigger things.
So really the only
way we know how
to do that is something like
what's sometimes in engineering
called a CAD model or
computer-aided design.
But it's basically
a representation
of three-dimensional
objects, often
with something like
a mesh or a grid
of key points with their masses
and springs for stiffness,
something like that.
Here my only picture
of the world state
looks an awful lot
like the image,
only it's in black and
white instead of color.
But the difference is that
the thing on the bottom
is actually an image.
Whereas the thing
on the top is just
a 2D projection of a 3D model.
I'll show you that one.
Here's a few others.
So I'll go back and
forth between these.
Notice how it kind of looks like
the blocks are moving around.
So what's actually
going on is these
are samples from the
Bayesian posterior
in an inverse graphics system.
We put a prior on
world states, which
is basically a prior on what we
think the world is made out of.
We think there's these
Jenga blocks basically.
And then the likelihood, which
is that that forward model is
the probability of seeing
a particular 2D image given
a 3D configuration of blocks.
And going back to
the thing you had,
it's basically deterministic
with a little bit of noise.
It's deterministic.
It just follows the
rules of OpenGL graphics.
Basically says
objects have surfaces.
They're not transparent.
You can't see through them.
That's an extra complication
if you wanted to have that.
And basically the
image is formed
by taking the closest
surface of the closest object
and bouncing a ray of light off
of it, which really just means
taking its color and
scaling it by intensity.
It's a very simple shadow model.
So that's the causal model.
And then we can add a little
bit of uncertainty like,
for example, maybe we can't--
there's a little bit of
noise in the sensor data.
So you can be uncertain about
exactly the low level image
features.
And then when you run one of
these probabilistic programs
in reverse to make a guess of
what configuration of blocks
is most likely to have
produced that image,
there is a little bit
of posterior uncertainty
that inherits from the fact that
you can't perfectly localize
those objects in the world.
So again, what you see here
are three or four samples
from the posterior--
the distribution
over best guesses
of the world state
of 3D objects that were
most likely to have rendered
into that 2D image.
And any one of those is now
an actionable representation
for physical manipulation
or reasoning.
OK?
And how we actually
compute that,
again, I'm not going
to go into right now.
I'll go into something
like it in a minute.
But at least in its
most basic form,
it involves some
rather unfortunately
slow random search
process through the space
of blocks models.
Here's another example.
This is another
configuration there--
another image.
And here is a few samples
again from the posterior.
And hopefully when you see
these things moving around,
whether it's this one
or the one before,
you see them move a little
bit, but most of them
look very similar.
You'd be hard pressed
to tell the difference
if you looked away for a second
between any one of those.
Which one are you
actually seeing?
And that's exactly the point.
The uncertainty you see there
is meant to capture basically
the uncertainty you
have in a single glance
at an image like that.
You can't perfectly tell
where the blocks are.
So basically any one
of these configurations
up here is about equally good.
And we think your
intuitive physics,
your sort of common sense
core intuitive physics
that even babies have,
is operating over one
or a few samples like that.
Now in separate work
that is not really--
I think of it as really
about common sense,
but it's one of the things
we've been doing in our group
and in CBMM where are these
ideas best make contact
with the rest of what
people are doing here
and where we can really test
interesting neural hypotheses
potentially and
understand the interplay
between these generative
models for explanation
and the more sort of
neural-network-type models
for pattern recognition.
We've been really
pushing on this idea
vision as inverse graphics.
So I'll tell you a little
bit about that because it's
quite interesting for CBMM.
But I want to make sure to only
do this for about five minutes
and then go back to
the how this gets
used for more the intuitive
physics and planning stuff.
So this is this an example from
a paper by Tejas Kulkarni who's
one of our grad students.
And it's joint work with a
few other really smart people
such as Vikash Mansinghka
who's a research
scientist at MIT
and Pushmeet Kohli
who's at Microsoft Research.
And it was a computer vision
paper, pure computer vision
paper from the summer,
where he was developing
a specific kind of probabilistic
programming language,
but a general one for
doing this kind of vision
as inverse graphics
where you could give
a number of different models.
Here I'll show you one for
faces, another one for bodies,
another one for generic objects.
But basically you can pretty
easily specify a graphics model
that when you run it in
the forward direction
generates random images of
objects in a certain class.
And then you can run it
in the reverse direction
to do scene parsing
to go from the image
to the underlying scene.
So here's an example
of this in faces
where the graphics model--
it's really very directly based
on work that Thomas Vetter,
who was a former student
or post-doc of
Tommy's actually, so
kind of a early
ancestor of CBMM, built.
And his group in Basel,
Switzerland, where
it's a simple but
still pretty nice
graphics model for
making face images.
There's a model of the shape
of the face, which again, it's
like a CAD model.
It's a mesh surface description.
Pretty fine-grained
structure of the 2D surface
of the face in 3D.
And there is about
400 dimensions
to characterize the
possible shapes of faces.
And there's another
400 dimensions
to characterize
the texture, which
is like the skin, the beard,
the eyes, the color, and surface
properties that get
mapped on top of the mesh.
And then there's a little
bit more graphic stuff,
which is generic, not
specific to faces.
That stuff is all
specific to faces.
But then there is a
simple lighting model.
So you basically have
a point light source
somewhere out there and you
shine the light on the face.
It can produce
shadows, of course,
but not very complicated ones.
And then there's a
viewpoint camera thing.
So you put the light source
somewhere and you put a camera
somewhere specifying
the viewpoint.
And the combination of these,
shape, texture, lighting,
and camera, give you a
complete graphic specification.
It produces an image
of a particular face
lit from a particular
direction and viewed
from some particular
viewpoint and distance.
And what you see on the
right are random samples
from this probabilistic
program, this generative model.
So you can just
write this program
and press Go, Go, Go, Go, Go,
and every time you run it,
you get a new face viewed from
a new direction and lighting
condition.
So that's the prior.
Now, what about inference?
Well, the idea of vision
as inverse graphics
is to say take a real image
of a face like that one
and see if you can
produce from your graphics
model something that
looks like that.
So, for example, here
in the lower left
is an example of a face that
was produced from the graphics
model that hopefully most of you
agree looks kind of like that.
Maybe not exactly the
same, but kind of enough.
And in building this system--
this system, by the
way, is called Picture.
That's that first word
of the paper title,
too, the Kulkarni, et al. paper.
There were a few neat
things that had to be done.
One of the things
that had to be done
was to come up with
various ways to say
what does it mean for the
output of the graphics engine
to look like the image.
In the case of faces,
actually matching up pixels
is not completely crazy.
But for most vision
problems, it's
going to be unrealistic and
unnecessary to build a graphics
engine that's
pixel-level realistic.
And so you might,
for example, want
to have something where the
graphics engine hypothesis is
matched to the
image with something
like some kind of features.
Like it could be convolutional
neural network features.
That's one way to use, for
example, neural networks
to make something
like this work well.
And Jojen just showed me a
paper by some other folks
from Darmstadt,
which is doing what
looks like a very interesting
similar kind of thing.
Let me show what inference looks
like in this model and then
say what I think is an
even more interesting way
to use convolutional.
And that's from another recent
paper we've been looking at.
So here is, if you watch this,
this is one observed face.
And what you're seeing
over here is just
a trace of the system
kind of searching
through the space of traces
of the graphics program.
Basically trying
out random faces
that might look like
that face there.
It's using a kind
of MCMC inference.
It's very similar to
what you're going to see
from Tomer in the tutorial.
It basically starts
off with a random face
and takes a bunch of
small random steps
that are biased towards making
the image look more and more
like the actual observed image.
And at the end,
you have something
which looks almost identical
to the observed face.
The key, right, though, is that
though the observed face is
literally just a
2D image, the thing
you're seeing on the
right is a projection
of a 3D model of a face.
And it's one that supports
a lot of causal action.
So here just to show you
on a more interesting sort
of high-resolution set of face
images, the ones on the left
are observed images.
And then we fit this model.
And then we can rotate it
around and change the lighting.
If we had parameters that
control the expression--
there's no real expression
parameters here-- that
wouldn't be too hard to put in.
You could make us happy or sad.
But you can see--
hopefully what you can see
is that the recovered model
supports fairly
reasonable generalization
to other viewpoints and
lighting conditions.
It's the sort of
thing that should
make for more robust
face recognition.
Although that's not
the main focus of what
we're trying to use it here.
I just want to emphasize
there's all sorts of things that
would be useful if you had an
actual 3D model of the face you
could get from a single image.
Or here's the same kind of idea
now for a body pose system.
So now, the image
we're going to assume
has a person in it
somewhere doing something.
Remember back to that challenge
I gave at the beginning
about finding the bodies in a
complex scene like the airplane
full of computer
vision researchers
where you found the right
hand or the left toe.
So in order to do
that, we think you
have to have something like
an actual 3D model of a body.
What you see on
the lower left is
a bunch of samples from this.
So we basically just took a kind
of interesting 3D stick figure
skeleton model and just
put some knobs on it.
You can tweak it around.
You can put some simple
probability models
to get a prior.
And these are just
random samples
of random body positions.
And the idea of the system
is to kind of search
through that space
of body positions
until you find one, which
then when you project it
from a certain
camera angle looks
like the body you're seeing.
So here is an example
of this in action.
This is some guy--
I guess Usain Bolt. Some kind
of interesting slightly unusual
pose as he's about to break
the finish line maybe.
And here is the
system in action.
So it starts off from
a random position
and, again, sort
of takes some takes
a bunch of random steps
moving around in 3D space
until it finds a
configuration, which
when you project it
into the image looks
like what you see there.
Now, notice a key difference
when I say looks like--
it doesn't look like it at the
pixel level like the face did.
It's only matching at the level
of these basically enhanced
edge statistics
which you see here.
So this is an example of
building a model that's
not a photorealistic render.
The graphics model is not
trying to match the image.
It's trying to match this.
Or it could be, for example,
some intermediate level
of convonet features.
And we think this
is very powerful.
Because more generally
while we might
have a really detailed model of
facial appearance, for bodies,
we don't have a
good clothing model.
We're not trying
to model the skin.
We're just trying
to model just enough
to solve the problem
we're interested in.
And again, this is reflective
of a much more broad theme
in this idea of
intelligence as explanation,
modeling the causal
structure of the world.
We don't expect,
even in science,
but certainly not in
our intuitive theories,
to model the causal structure
of the world at full detail.
And a way that either I am
always misunderstood or always
fail to communicate--
it's my fault really--
is I say, oh, we have these
rich models of the world.
People often think
that means that somehow
the complete thing.
Like if I say we have a
physics engine in our head,
it means we have all of physics.
Of if I say we have
a graphics engine,
we have all of every
possible thing.
This isn't Pixar.
We're not trying to
make a beautiful movie,
except maybe for faces.
We're just trying to capture
just the key parts, just
the key causal parts
of the way things move
in the world as physical
objects and the way images
are formed that at the
right level of abstraction
that matters for us allows
us to do what we need to do.
This is just an
example of our system
solving some pretty challenging
body pose recognition problems
in 3D, cases which
are problematic
even for the best of standard
computer vision systems.
Either because
it's a weird pose,
like these weird sports
figures, or because the body
is heavily occluded.
But I think, again, these are
problems which people solve
effortlessly.
And I think something
like this is
on the track of
what we want to do.
You can apply the
same kind of thing
to more generic
objects like this,
but I'm not going to
go into the details.
The last thing I want
to say about vision
before getting back to common
sense for a few minutes--
and in some sense, maybe this
is the most important slide
for the broader CBMM, brains,
minds, and machines thing.
Because this is
the clearest thing
I can point to to the thing
I've been saying all along
since the beginning of
the morning about how
we want to look for ways to
combine the generative model
view and the pattern
recognition view.
So the generative model is
what you see on the left here.
It's the arrows going down.
It's exactly just the
face graphics engine,
the same thing I showed you.
The thing on the right with the
arrows going up is a convonet.
Basically it's a
out-of-the-box, cafe-style,
convolutional neural net with
some fully connected layers
on the top.
And then there's a few
other dashed arrows
which represent linear decoders
from layers of that model
to other things,
which are basically
parts of the generative model.
And the idea here-- this is
work due to Ilker Yildirim, who
some of you might have met.
He was here the other day.
He's one of our CBMM
postdocs, but also
joint with Tejas and with
Winrich who you saw before.
It's to try to in
several senses combine
the best of these
perspectives, to say, look,
if we want to recognize
anything or perceive
the structure of
the world richly,
I think it needs to be something
like this inverse graphics
or inverting a graphics program.
But you saw how slow it was.
You saw how it took a
couple of seconds at least
on our computer just
for faces to search
through the space
of faces to come up
with a convincing hypothesis.
That's way too slow.
It doesn't take that long.
We know a lot about exactly how
long it takes you from Winrich,
and Nancy's, and many
other people's work.
So how can vision in this case,
or really much more generally,
be so rich in terms of the
model it builds, yet so fast?
Well, here's a
proposal, which is
to take the things that are good
at being fast like the pattern
recognizers, deep
ones, and train
them to solve the
hard inference problem
or at least to do
most of the work.
It's an idea which is
very heavily inspired
by an older idea
of Geoff Hinton's
sometimes called the
Helmholtz machine.
Here the idea in
common with Hinton
is to have a generative
model and a recognition model
where the recognition
model is a neural network
and it's trained to invert
the generative the model.
Namely, it's trained to map not
from sense data to task output,
but from sense data to
the hidden deep causes
of the generative model, which
then, when you want to use this
to act to plan what you're going
to do, you plan on the model.
To make an analogy to say the
DeepMind video game player,
this would be like
having a system which,
in contrast to the
DeepQ network, which
mapped from pixel images
to joystick commands,
this would be like
learning a network that
maps from pixel images
to the game state,
to the objects, the sprites
that are moving around,
the score, and so on,
and then plans on that.
And I think that's much more
like what people do is that.
Here just in the
limited case of faces,
what are we doing, right?
So what we've got
here is we take
this convolutional
neural network.
We train it in ways that you
can read about in the paper.
It's very easy kind of training
to basically make predictions,
to make guesses
about all the latent
variables, the shape, the
texture, the lighting,
the camera angle.
And then you take those
guesses, and they start off
that Markov chain.
So instead of starting off at
a random graphics hypothesis,
you start off at
a pretty good one
and then refine it a little bit.
What you can see
here in these blue
and red curves is the blue
curve is the course of inference
for the model I
showed you before,
where you start off
at a random guess,
and after, I don't know, 100
iterations of MCMC, you improve
and you kind of get there.
Whereas the red
curve is what you
see if you start off with
the guess of this recognition
model.
And you can see that
you start off sort
of in some sense almost as good
as you're ever going to get,
and then you refine it.
Well, it might look
like you we're just
were refining it a little bit.
But this is a kind of
a double log scale.
It's a log plot of
log probability.
So what looks like a little
bit there on the red curve
is actually a lot--
I mean perceptually.
You can see it here where
if you take-- on the top
I'm showing observed
input faces.
On the bottom I'm
showing the result
of this full inverse
graphics thing.
And they should look
almost identical.
So the full model is able to
basically perfectly invert this
and come up with a
face that really does
look like the one on the top.
The ones in the middle
are the best guess
you get from this
neural network that's
been trained to approximately
invert the generative model.
And what you can see
is on first glance
it should look pretty good.
But if you pay a little
bit of attention,
you can see differences.
Like hopefully you can see
this person is not actually
that person in a way that this
is much more convincingly.
Or this person-- this
one is pretty good,
but I think this one--
I think it's pretty
easy to say, yeah,
this isn't quite the
same person as that one.
Do you guys agree?
We've done some
experiments to verify this.
But hopefully they should
look pretty similar,
and that's the point.
How do you combine the best of
these computational paradigms?
How can perception
more generally
be so rich and so fast?
Well, quite possibly like this.
It even actually might
provide some insight
into the neural circuitry
that Winrich and Doris Tsao
and others have mapped out.
We think that this
recognition model that's
trained to invert
the graphics model
can provide a really nice
account of some of Winrich's
data like you saw before.
But I will not go
into the details
because in maybe
five to 10 minutes
I want to get back to
physics and psychology.
So physics-- and there won't
be any more neural networks.
Because that's about as much--
I mean, I think we'd like to
take those ways of integrating
the best of these
approaches and apply them
to these more general cases.
But that's about as
far as we can get.
Here what I want to just
give you a taste of at least
is how we're using
ideas just purely
from probabilistic
programs to capture
more of this common sense
physics and psychology.
So let's say we can solve
this problem by making
a good guess of
the 3D world state
from the image very quickly
inverting this graphics engine.
Now, we can start to do
some physical reasoning,
a la Craik's mental model of in
the head of the physical world,
where we now take a
physics engine, which is--
here again we're using the
kind of physics engines
that game physics--
like very simple--
again, I don't have time
to go into the details.
Although Tomer has written a
very nice paper with, well,
with himself.
But he's nicely put my
name and Liz's on it--
about sort of
trying to introduce
some of the basic
game engine concepts
to cognitive scientists.
So hopefully we'll be able
to show you that soon too.
Or you can read about them.
Basically it's that these
physics engines are just
doing again a very quick, fast,
approximate implementation
of certain aspects of
Newtonian mechanics.
Sufficient that if
you run it a few times
steps with a
configuration of objects
like that you
might get something
like what you see over
there on the right.
That's an example of running
this approximate Newtonian
physics forward
a few time steps.
Here's another sample from
this model, another kind
of mental stimulation.
We take a slightly different
guess of the world state,
and we run that forward
a few time steps,
and you see something
else happens.
Nothing here is claimed to be
accurate in the ground truth
way.
Neither one of these is
exactly the right configuration
of blocks.
And you run this thing forward,
and it only approximately
captures the way blocks
really bounce off each other.
It's a hard problem to actually
totally realistically simulate.
But our point is that
you don't really have to.
You just have to make
a reasonable guess
of the position of the
blocks and a reasonable guess
of what's going to happen a
few time steps in the future
to predict what you need to
know and common sense, which
is that, wow, that's
going to fall over.
I better do something about it.
And that's what our
experiment taps into.
We give people a
whole bunch of stimuli
like the ones I showed
you and ask them,
on some graded
scale, how likely do
you think it is to fall over?
And what you see here--
this is again one of those plots
that always are the same where
on the y-axis are the average
human judgments now of-- it's
an estimate of how
unstable the tower is.
It's both the probability
that it will fall,
but also how much of
the tower will fall.
So it's like the
expected proportion
of the tower that's going
to fall over under gravity.
And along the x-axis is
the model prediction,
which is just the average
of a few samples from what
I showed you.
You just take a few
guesses of the world state,
run it forward a few
time steps, count up
the proportion of blocks
it fell, and average that.
And what you can
see is that does
a really nice job of predicting
people's stability intuitions.
I'll just point to an
interesting comparison.
Because it does come into where.
Where does the
probability come in
in these probabilistic programs?
Well, here's one
very noticeable way.
So if you look down
there on the lower right,
you'll see a smaller
version of a similar plot.
It's plotting now
the results of--
it says ground truth physics,
but that's a little misleading
maybe.
It's just a noiseless
physics engine.
So we take the
same physics model,
but we get rid of any of
the state uncertainties.
So we tell it the true
position of the blocks,
and we give it the true physics.
Whereas our probabilistic
physics engine
allows for some uncertainty
in exactly which forces
are doing what.
But here we say we're just going
to model gravity, friction,
collisions as best we can.
And we're going to get the
state of the blocks perfectly.
And because it's noiseless,
you notice that--
so those crosses over
there are crosses
because they're arrow bars,
both across people and model
simulations.
Now they're just vertical lines.
There's no arrow bars
in the model simulation
because it's deterministic.
It's graded because there's the
proportion of the tower that
falls over.
But what you see is the
model is a lot worse.
It scatters much more.
The correlation
dropped from around 0.9
to around 0.6 in terms
of correlation of model
with people's judgments.
And you have some cases
like this red dot here--
that corresponds
to this stimulus--
which goes from being a
really nice model fit.
This is one which people
judged to be very unstable,
and so does the
probabilistic physics engine.
But actually it's
not unstable at all.
It's actually perfectly stable.
The blocks are actually
just perfectly balanced
so that it doesn't fall.
Although I'm sure
everybody looks at that
and finds that hard to believe.
So this is nice.
This is a kind of
physics illusion.
There are real world versions
of this out on the beaches
not too far from here.
It's a fun thing to do to
stack up objects in ways
that are surprisingly stable.
We would say a surprise
because your intuitive physics
has certain irreducible noise.
What we're suggesting here is
that your physical intuitions--
you're always in
some sense making
a guess that's sensitive to
the uncertainty about where
things might be and what forces
might be active on the world.
And it's very hard to see
these as deterministic physics,
even when you know that
that's exactly what's going on
and that it is stable.
Let me say just a little
bit about planning.
So how might you use
this kind of model
to build some model of this
core intuitive psychology?
And I don't mean here
all of theory of mind.
Next week, we'll
hear a lot more.
Like Rebecca Saxe
will be down here.
We'll hear a lot more about
much richer kinds of reasoning
about other people's
mental states
that adults and older
children can do.
But here we're
talking about, just
as we were talking
about what I was calling
core intuitive physics, again
inspired by Liz's work of just
you know what objects do right
here on the table top around us
over short time scales,
the core theory of mind,
something that even very young
babies can do in some form,
or at least young children.
There's controversy over
exactly what age kids can be
able to do this sort of thing.
But in some form I
think before language,
it's the kind of thing that when
you're starting to learn verbs,
the earliest language
is kind of mentalistic
and builds on this knowledge.
And take the red and blue ball
chasing scene that you saw,
remember from Tomer.
That was 13-month-olds.
So there's definitely some
form of kind of interpretation
of beliefs and desires
in some protoform
that you can see even in infants
of around one year of age.
And it's exactly that
kind of thing also.
Remember that, if you saw John
Leonard's talk yesterday--
he was the robotics guy who
talked about self-driving cars
and how there's certain
gaps in what they
can do despite the
all the publicity,
like the can't
turn left basically
in an unrestricted intersection.
Because there's a
certain kind of theory
of mind in street scenes when
cars could be coming and people
could be crossing
or all those things
about the police officers.
Part of why this is
so exciting to me
and why I love that talk is
because this is, I think,
that same common sense
knowledge that if we can really
figure out how to capture
this reasoning about beliefs
and desires in the
limited context
where desires are people moving
around in space around us
and the beliefs
are who can see who
and who can see
who can see who--
in driving, the art of making
eye contact with other drivers
or pedestrians is seeing
that they can see you
or that they can
see what you can see
and that they can see
you seeing you them.
It doesn't have to be
super deeply recursive,
but it's a couple
of layers deep.
We don't have to think
about it consciously,
but we have to be able to do it.
So that's the kind
of core belief desire
theory of mind reasoning.
And here's how we've
tried to capture this
with probabilistic programs.
This is work that Chris Baker
started doing a few years ago.
And a lot of it
joint Rebecca Saxe
and also some of it with Julian
Jara-Ettinger and some of it
with Tomer.
So there's a whole
bunch of us who've
been working on
versions of this,
but I'll just show you
one or two examples.
Again, the key programs here
are not graphics or physics
engines, but planning engines
and perception engines.
So very simple kinds
of robotics programs,
far too simple in
this form to build
a self-driving car
or a humanoid robot,
but maybe the kind of thing that
in game robots like the zombie
or the security guard
in Quake or something
might do something like this.
So planning basically just
means it's a little bit more
than sort of shortest
path planning.
But it's basically like
find a sequence of actions
in a simple world
like moving around
a 2D environment that maximizes
your long run expected reward.
So there's a kind
of utility theory,
or what Laura Schulz calls a
naive utility calculus, here.
A calculation of costs and
benefits where in a sense
you get a big reward,
a good positive utility
for getting to your goal and a
small cost for each action you
take.
And under that view,
then in some sense--
and some actions might be
costly than others, something
that Tomer is looking at
in infants and something
that Julian
Jara-Ettinger has looked
at in older kids, this
understanding of that.
But this sort of basic
cost-benefit trade
off that is going on whenever
you move around an environment
and decide, well, is it
worthwhile to go all the way
over there, or, well, I know
I like the coffee up at Pie
in the Sky better than the
coffee in the dining hall
here at Swope.
But to think about, am I
going to be to my lecture?
Am I going to be late
to Nancy's lecture?
Those are different costs--
both costs.
It's that kind of calculation.
So here let me
get more concrete.
So here's an example
of an experiment
that Chris did a few years
ago where, again, it's
like what you saw what the
Heider and Simmel, the squares
and the triangles and
circles or the south gate
and chibra, the red and blue
balls chasing each other.
Very simple stuff.
Here you see an agent.
It's like an overhead
view of a room,
2D environment from the top.
The agents moving
along some path.
There are three possible
goals, A, B, or C.
And then there's maybe some
obstacles or constraints
like a wall like like
you saw in those movies.
Maybe the wall has a hole
that he can pass through.
Maybe it doesn't.
And across different
trials of the experiment,
just like in the physics
stuff we vary all the block
configurations and so on, here
we vary where the goals are.
We vary whether the
wall has a hole or not.
We vary the agent's path.
On different trials, we also
stop it at different points.
Because we're trying to
see as you watch this agent
move around, action
unfolds over time.
How do your guesses about
his goal change over time?
And what you see--
so these are just examples
of a few of the scenes.
And here what you see
are examples of the data.
Again, the y-axis is the
average human judgment.
Red, blue, and green is
color coded to the goal.
They're just asked,
how likely do
you think each of those
three things is his goal?
And then here the
x-axis is time.
So these are time steps that
we ask at different points
along the trajectory.
And what you can
see is that people
are making various systematic
kinds of judgments.
Sometimes they're not sure
whether his goal is A or B,
but they know it's
not C. And then
after a little while or
some key stat happens,
and now they're quite
sure it's A and not B.
Or they could change their mind.
Here people were pretty sure it
was either green or red but not
blue.
And then there comes a point
where it's surely not green,
but it might be blue or red.
Oh no, then it's red.
Here they were pretty
sure it was green.
Then no, definitely not green.
And now, I think it's red.
It was probably never blue.
OK.
And the really
striking thing to us
is how closely, you can
match those judgments
with this very simple
probabilistic planning
program run in reverse.
So we take, again, this
simple planning program
that just says basically just
kind of get as efficiently
as possible to your goal.
I don't know what
your goal is though.
I observe your actions that
result from an efficient plan,
and I want to work
backwards to say,
what do I think your
goal is, your desire,
the rewarding state?
And just doing
that just basically
perfectly predicts
people's data.
I mean, of all the
mathematical models of behavior
I've ever had a
hand in building,
this one works the best.
It's really quite striking.
To me it was striking
because I came
in thinking this would be a
very high-level, weird, flaky,
hard-to-model thing.
Here's just one
more example of one
of these things, which
actually puts beliefs in there,
not just desires.
So it's a key part of
intuitive psychology
that we do joint inference
over beliefs and desires.
In this one here, we assume
that you, the subject,
the agent who's moving
around, all of us
have shared full
knowledge of the world.
So we know where
the objects are.
We know where the holes are.
There's none of
this false belief,
like you think something
is there when it isn't.
Now, here's some
later work that Chris
did, what we call the
food truck studies,
where here we add in some
uncertainty about beliefs
in addition to desires.
And it's easiest just to
explain with this one example
up there in the upper left.
So here, and this, like a
lot of university campuses,
lunch is best found
at food trucks,
which can park in different
spots around campus.
Here the two yellow squares
show the two parking spots
on this part of campus.
And there are several
different trucks
that can come and park
in different places
on different days.
There's a Korean truck.
That's k.
There's a Lebanese truck.
That's l.
There's also other trucks
like a Mexican truck.
But there's only two spots.
So if the Korean won parks
there and the Lebanese one
parks there, the Mexican
has to go somewhere else
or can't come there today.
And on some days the trucks
park in different places.
Or a spot could
also be unoccupied.
The trucks could be elsewhere.
So look at what
happens on this day.
Our friendly grad
student, Harold,
comes out from his office here.
And importantly, the way we
model interesting notions
of evolving belief
is that now we've
got that perception and
inference arrow there.
So Harold forms his
belief about what's
where based on what he can see.
And it's just the simplest
perception model, just
line-of-sight access.
We assume he can kind of see
anything that's unobstructed
in his line of sight.
So that means that
when he comes out here,
he can see that there is
the Korean truck here.
But you can't see-- this
is a wall or a building.
He can't see what's on
the other side of that.
OK, so what does he do?
Well, he walks down here.
He goes past the Korean
truck, goes around
the other side of the building.
Now at this point,
his line of sight
gives him-- he
can see that there
is a Lebanese truck there.
He turns around, and he goes
back to the Korean truck.
So the question for you is,
what is his favorite truck?
Is it Korean,
Lebanese, or Mexican?
AUDIENCE: Mexican.
PROFESSOR: Mexican,
yeah, it doesn't sound
very hard to figure that out.
But it's quite interesting
because the Mexican one
isn't even in the scene.
The most basic kind of goal
recognition-- and this,
again, cuts right to the
heart of the difference
between recognition
and explanation.
There's been a lot of
progress in machine vision
systems for action
understanding, action
recognition, and so on.
And they do things like, for
example, they take video.
And the best cue that
somebody wants something
is if they reach for
it or move towards it.
And that's certainly
what was going on here.
In all of these scenes,
your best inference
about what the guy's
goal is is which
thing is he moving towards.
And it's just
subtle to parse out
the relative degrees
of confidence
when there's a complex
environment with constraints.
But in every case,
by the end it's
clear he's going for
one thing, and the thing
he is moving towards
is the thing he wants.
But here you have
no trouble realizing
that his goal is
something that isn't
even present in the scene.
Yet he's still
moving towards it.
In a sense, he's moving towards
his mental representation
of it.
He's moving towards the Mexican
truck in his mind's model.
And that's him explaining
the data he sees.
For some reason,
he must have had
maybe a prior belief that the
Mexican truck would be there.
So he formed a plan to go there.
And in fact, we can
ask people not only
which truck does he like--
it's his Mexican truck.
That's what people say,
and here is the model.
But we also asked them
a belief inference.
We say, prior to
setting out, what
did Harold think was
on the other side?
What was parked in the other
spot that he couldn't see?
Did he think it was Lebanese,
Mexican, or neither?
And we ask a degree of belief.
So you could say he had no idea.
But interestingly,
people say he probably
thought it was Mexican.
Because how else could you
explain what he's doing?
So I mean, if I had to point to
just one example of cognition
as explanation, it's this.
The only sensible way, and it's
a very intuitive and compelling
way to explain why did
he go the way he did
and then turn around
just when he did
and wind up just where he did,
is this set of six instances
basically.
That his favorite is
Mexican, his second favorite
is Korean-- that's also
important-- his least
favorite is Lebanese.
And he thought that
Mexican was there,
which is why it was
worthwhile to go and check.
At least, he thought
it was likely.
He wasn't sure, right?
Notice it's not very high.
But it it's more likely than
the other possibilities.
Because, of course, if he was
quite sure it was Lebanese,
well, he wouldn't have
bothered to go around there.
And in fact, you do see that.
So you have ones--
I guess I don't have them here.
But there are scenes where
he just goes straight here.
And then that's consistent
with him thinking possibly it
was Lebanese.
And if he thought
nothing was there,
well, again, he wouldn't
have gone to check.
And again, this model is
extremely quantitatively
predictive of people's judgments
about both desires and beliefs.
You can read in some
of Battaglia's papers
ways in which you take the
very same physics engine
and use it for all these
different tasks, including
sort of slightly weird
ones like these tasks.
If you bump the
table, are you more
likely to knock off red
blocks or yellow blocks?
Not a task you ever got any
end-to-end training on, right?
But an example of
the compositionality
of your model and your task.
Somebody asked me
this during lunch,
and I think it is a key point
to make about compositionality.
One of the key ways in
which compositionality
works in this view of the
mind, as opposed to the pattern
recognition view
or the way, let's
say, like a DeepQ
network works--
AUDIENCE: You mean
the [INAUDIBLE]..
PROFESSOR: Just ways of getting
a very flexible repertoire
of inferences from composing
pieces without having
to train specifically for it.
It's that if you have
a physics engine,
you can simulate
the physical world.
You can answer questions
that you've never gotten
any training at all to solve.
So in this experiment
here, we ask
people, if you bump
the table hard enough
to knock some of the
blocks onto the floor,
is it more likely to be
red or yellow blocks?
Unlike questions
of will this tower
fall over, which we've made a
lot of judgments of that sort.
You've never made that
kind of judgment before.
It's a slightly weird one.
But you have no
trouble making it.
And for many different
configurations of blocks,
you make various
grade adjustments,
and the model captures it
perfectly with no extra stuff
put in.
You just you just
take the same model,
and you ask it a
different question.
So if our dream is to
build AI systems that
can answer questions,
for example, which
a lot of people's
dream is, I think
there's really no
compelling alternative
to something like this.
That you build a model that
you can ask all the questions
of that you'd want to ask.
And in this limited domain,
again, it's just our Atari.
In this limited
domain of reasoning
about the physics of blocks,
it's really pretty cool
what this physics
engine is able to do
with many kinds of questions.
It can reason about things
with different masses.
It can make guesses
about the masses.
You can make fun of the
objects bigger or smaller.
You can attach constraints
like fences to the table.
And the same model, without
any fundamental change,
can answer all these questions.
So it doesn't have
to be retrained.
Because there's
basically no training.
It's just reasoning.
If we want to understand
how learning works,
we first have to
understand what's learned.
I think right now,
we're only at the point
where we're starting
to really have
a sense of what are these mental
models of the physical world
and intentional action--
these probabilistic programs
that even young children
are using to reason
about the world.
And then it's a
separate question
how those are built up
through some combination
of scientific discovery sorts
of processes and evolution.
So here's the story,
and I've told most
of what I want to tell you.
But the rest you'll
get to hear--
some of it you'll
get to hear next week
from both our
developmental colleagues
and from me and Tomer.
More on the computational side.
But actually the
most interesting part
we just don't know yet.
So we hope you will actually
write that next chapter
of this story.
But here's the outlines of
where we currently see things.
We think that we have a good
target for what is really
the core of human intelligence,
what makes us so smart in terms
of these ideas of
both what we start
with, this common sense
core physics and psychology,
and how those things grow.
What are the learning mechanisms
that I've just justified.
Again, more next week on the
sort of science-like mechanisms
of hypothesis formation,
experiment testing, play,
exploration that you can use to
build these intuitive theories,
much like scientists build
their scientific theories.
And that we're starting
on the engineering side
to have tools to capture this,
both to capture the knowledge
and how it might
grow through the use
of probabilistic
programs and things
that sometimes go by the name
of program induction or program
synthesis.
Or if you like
hierarchical Bayes
on programs that generate
other programs where
the search for a good program is
like the inference of a program
that best explains the data as
generated from a prior that's
a higher level program.
If you go to the
tutorial from Tomer
you'll actually see
building blocks.
You can write
Church programs that
will do something
like that, and we will
see more of that next time.
But the key is that
we have a language now
which keeps building the
different ingredients that we
think we need.
On the one hand, we've gone from
thinking that we need something
like probabilistic generative
models, which many people will
agree with, to
recognizing that not only
do they have to be
generative, they have
to be causal and compositional.
And they have to have this
fine-grained compositional
structure needed to capture
the real stuff of the world.
Not graphs, but something
more like equations
that capture graphics
or physics or planning.
Of course, that's not all.
I mean, as I tried
to gesture at,
we need also ways to make these
things work very, very quickly.
There might be a place in
this picture for something
like neural networks
or some kind
of alternative
pro-and-con approach based
on pattern recognition.
But these are just
a number of the ways
which I think we need to
think about going forward.
We need to take the idea of both
the brain as a pattern ignition
engine seriously and the idea
of the brain as a modeling
or explanation engine seriously.
We're excited
because we now have
tools to model modeling
engines and maybe
to model how pattern
recognition engines and modeling
engines might interact.
But really, again,
the great challenges
here are really very
much in our future.
Not the unforeseeable future,
but the foreseeable one.
So help us work on it.
Thanks.
