Professor Anna Choromanska : Welcome to the first talk of 2019 seminar series on modern artificial
intelligence at the NYU Tandon School
of Engineering. This series which we
launched last year, aims to bring
together faculty students and
researchers.. to discuss the most
important trends in the world of AI, and
this year's speakers like last year's
our world renowned experts, whose
research is making an immense impact on
the development of new machine learning
techniques and technologies. In the
process they are helping to build a
better smarter more connected world and
we are proud to have them here. By the
way the talks are life stream and
viewed around the globe.... helping to
spread the word about the amazing work
going on in the AI community I would
like to thank Jelena Kovačević  as
well as my own Department of Electrical
and Computer Engineering for supporting
this series and graciously hosting our
esteemed speakers. As many of you may
already know Dean Kovačević which came to
us from Carnegie Mellon University in
Pittsburgh and she has that in common
with today's speaker Marshall Habert who
heads the robotics Institute at Carnegie
Mellon, just like Brooklyn Pittsburgh has
a storied history that captures the
imagination, many people know it as a
Steel City because of its status as a
hub of Industry and steel production,
today thanks in some part to the efforts
of Marshall and his fellow researchers
it has another nickname robotics borough.
Today more than 60 robotics companies
make their home in Allegheny County, many
as products of the strong robotics
program at CMU. Marshall own research
interests include computer vision and
robotics especially recognition in
images and video data, model building and
object recognition from 3d data and
perception for mobile robots and for
intelligent vehicles. His group has
developed approaches for object
recognition and scene analysis in
images, 3d point clouds and video
sequences. In the area of machine
perception of Robotics his group has
developed techniques for people
detection tracking and prediction and
for understanding the environment of
land vehicles from sensor data. He also
currently serves as editor-in-chief of
the International Journal of computer
vision, Marshall will be speaking to us
today about the formidable research
challenges
using computer vision and robotics
systems and illustrating his
presentation with compelling examples
from autonomous air and ground robots. I
know none of us can wait to hear him, so
without further ado I'll invite him to
the stage. [Marshall Habert :] Thank you... great to be here on this campus, can you hear me yes,
obviously you can hear me in the back
right.. all right, so I'm going to talk to
you a little bit about computer vision
and more specifically about the issues
that we face in using computer vision in
autonomous systems or more generally
robotics robotic system. The story here
is that we've seen extremely rapid
progress over the past few years in
computer vision, fueled in part by
progress in machine learning, deep
learning and other other developments.
However, progress has been much slower in moving those results to autonomous
system.. robotic system and what I'd like
to discuss is a few of the reasons why
that is and a few of the research areas,
if you will, that we need to address to
accelerate that connection between the
progress in computer vision and
autonomous system. So when I talk about
autonomous systems I mean things like
this.. this is a video from some work from
some years ago of a drone flying through
dense.. a forested area like this, the
drone has input only from a single
camera and has to generate in real-time..
the steering command left/right steering
command here. This is another example
here from another group at CMU, the
robotics Institute, this is a 3d mapping
of an environment.. this is in the context
of safe driving... for example a car
driving on the street, you see the speed
here, if you do the calculation.. that's
about 100 kilometres an hour in the
street of Pittsburgh, we're not supposed
to do that.. so unfortunately I'm being
streamed.. so that's bad but ok. This is a
real application, this is a from a group
at national robotics engineering Center,
the robotics Institute, this is a
collaboration with Caterpillar, those
trucks have been operate
for many years now and have driven
thousands of miles.. those are fully
autonomous.. okay, there's no human
interaction here, so this is basically
one example of self-driving type of.. type
of thing, this is a slightly simpler
environment than the own world
self-driving because of course you don't
have other agents and so forth,
but that gives you an idea of what we
try to do in robotics and in autonomous
systems. This is another example, this is
from a bass teachers group and since
you're seeing, this is a fully autonomous
helicopter.. now you have to deal with you
know, everything from bad sensor data
from complicated dynamics.. okay, if you
think of the planning that you need to
do, there's no room for error here and
this is kind of the point I'm trying to
make with those examples, this is the key
difference between.. you know what is
called some time... time type1 AI, the kind
of thing that you have in a
question/answer system, you know Siri or
Alexa.. all that kind of thing were first
of all you can make some mistakes.. it's
okay, nobody dies.. okay, you also.. you also
don't have a strict time and computation
constraint, unlike the example that I
showed there... where an output needs to be
generated within some number of
milliseconds and the decision has to be
taken within some number of milliseconds
and you cannot make a mistake. So those
are the kind of additional constraints
that we have to take into account now,
when migrating those computer vision
techniques to those kind of system, okay.
Another way to say it is, if you see your
typical vision paper you know and you
have 95 percent performance on, whatever,
that.. I said or 99 percent well for those
systems 99 percent is not good enough,
you need to do.. you need to have
basically continuous performance in
every single image that you acquire,
every single bit of sensor data. So those
are the kind of problems that.. that we
need to look at to be able to use those
techniques effectively and by the way if
you interested there's a couple of
sources here, the you know some special
issues of Robotics Journals on
long-term autonomy and robust autonomy
and since ideas that analyze in more
details, the kind of requirements that we
have for the those systems. So what I'm
going to do is look at some.. basically
three or four classes of issues that we
need to address in.. envision to make it
work basically in those kind of systems
and to do that I'm going to use this
diagram, which is the simplest possible
view of an autonomous system.. okay,
actually you have one letter.. that's even
simple with only two boxes, but we're
going to start with four boxes.. okay. So
we basically have some sensory input for
discussion doesn't really matter, if it's
images or videos or 3d data, you have
some sensory input continuously, you have
some perception box here.. that, that's
outputs some form of output.. some
interpretation of the environment and
from that you want to make a decision, in
the example, I shall be previously, it's
for example planning for the helicopter
or for the for the truck and so forth
okay. So the first problem that I'm going
to look at.. is this arrow  here, the way
I've drawn this here... it looks like this
perception box is always going to give
an answer right ! You get an input it
gives an answer, you make your decision
based on this, the problem with that.. is
that this assumes, that this perception
box is always able to give an answer,
that the input is always going to fall
within a universe of input that does
precession box can actually process, that
is a huge assumption because it assumes
that the system is always going to be in
conditions that it is it's being trained
for... so the first set of problems that we
look at and I will show you one example
of an approach for this.. is to as the
question should we always give an answer
through the perception system always
give an answer and number one, and number
two, can we anticipate when the
perception system should not give an
answer... or at least that sensor is not
reliable.
So let me, and this comes on the various
names, introspection, self-evaluation and
failure detection.. since I guess.. in fact
there is a
an entire new in our program you
reprogram exactly on this topic of
introspection and self evaluation, so this is
this is an important.. important.. area. So
let me show you straight that on one
example, this is a typical vision test
cemented segmentation, this is what you
would do in self-driving, this is a video
from a car and the output interpretation
here.. big progress there of things work
pretty well now with this, you now put
this on.. on a robot and then since yhings can
go pretty badly okay, they can go pretty
badly because of poor illumination
condition, things are too close, too far..
any number of of condition, and the
reason of course is that, it's impossible
to anticipate all the conditions under
which the system is going to operate and
it's impossible to anticipate basically
all those things that we.. that we go on,
in fact, we did this we provided our code
for semantic segmentation to people from
the Army Research Lab, with whom we're
with which we were working and they
basically Twitter our code and evaluated
on the robot and then produce this 80
page report on everything that went
wrong, it's... it's a really humbling
experience, you know and of course a lot
of things go, so what you really want to
do.. is to be able to look at the input
and automatically figure out.. whether I
should really use this input or the the
output of the perception system to make
a decision, so a simple analogy and
that's only an analogy is, if you're
driving and you suddenly surrounded by
fog.. you know.. you know, instantly that
your vision system is impaired and you
know instantly that you cannot make
decision the way you were doing decision.
So you're going to automatically switch
to a different mode of driving instant,
so the question is can we.. approaches
that allow us to do this automatically
from the input, so this is kind of
related to a concept of introspection,
basically the ability to evaluate the
input to vision algorithm and to assess
whether this input is going to.. is going to yield to
enough performance to be able to make
decision and have performance of view of
the overall.. of the overall system. So
that's what that's... what we.. that's what
we're trying to do. Now one way to do
this, would be to explicitly anticipate
those failure cases, basically explicitly
try to code those conditions on the
input, that are going to lead to bad
performance... but of course that's not
going to work.. because you cannot by
definition anticipate all those all
those condition, the second way of doing
this is to learn features from the input
that are predictive of of performance..
okay, if we can do that.. given the input,
we can have some prediction of the
overall system performance and we can
then decide whether or not we're going
to make those decision. So this is what
we do, you basically have a lot of
examples of the system in operation and
you try to have a predictor that will
take input and predict the overall
performance and from that you can decide
whether or not to use the input, well for
example.. in my example of the fog whether
or not your visual system is good enough
to drive, this is an example non-verbally
example.. this is a full task your
segmentation task... image segmentation
task and what the system here the train
system here, predicts is.. that the images
on the on the Left are hard to segment,
in other words your vision algorithm is
not going to do very well and the images
on the right are easy to segment, now of
course you look at this and you go well
the obviously... those are harder than
those the, the important thing is to be
able to do this automatically using a
systematic approach to be able.. to be
able to do this, this is another example
closer to what I showed earlier in the
symmetry segmentation case, where again
those are hard images.. ok, we want where
the performance is going to be low, since
I'm going to go on and good images here,
so if we can do this.. we can basically
anticipate when failures of the vision
system are going to occur and we can
make a decisions... decisions based on this
okay,
by the way this is a very old concept
the... this goes back to early days of
pattern recognition in radars and since
like this, where people used at those
days... use the term that is not used
anymore very much.. but should be
something called the declaration rate, so
you're familiar with the detection rate
and the false positive rate... if you have
a classifier or any type of you know
system.. so that's typically the
performance curve, you know precision
recall or our C curves with the two axes,
there but you can have a third axis
which is what is called the declaration
rate and the declaration rate means the
the percentage of time that you actually
going to produce an output... okay, the
equation rate of a hundred percent means
that you always use the output, now is
you always assume that you are going on
performs.. performs reasonably, a very low
decoration rate.. means that you're going
to reject.. most of the... most of the most
of the inputs now, of course, the accuracy
is very high, here you've been very
conservative in fact, the.. your vision
system is going to work really well if
you never use the output, it's never won
okay and of course the performance is
worth.. when you use all of the inputs
because sometimes it's going to fail
catastrophically okay, so.. basically the
idea is that if you can do this
introspection, meaning this prediction of
performance... you can add the third axis
and you can basically set that threshold
on the stick declaration rate axis.. okay,
if you're very conservative you're going
to be on this side, if you want to.. if you
willing to take more risk... you're going
to be on that side okay. So let me give
you an example here, this is an example
from a drone flying okay, this is what I
tell myself every morning when I get up
you know and so basically what we're
trying to do is look at the... so this is
an example here vision system, this is
input video, this is monocular by the way
input video and output estimated depth
map, there is nothing perform about the
the vision system that is used here, this
is pretty poorest.. oh no this is another
here.. the system in action so this is the
input video here, the estimated depth map
and this is a 3d view of the same thing
the way this works is that, it constructs
this 3d representation of... local 3d
representation of the environment and
evaluates trajectories to fly... to fly
through. Now the problem of course is
that.. if the vision goes bad here, if this
estimate of the 3d environment is bad..
then we can have a catastrophic failure
of this estimating the trajectory and catastrophic
failure of the system. So what we do is
to... take that input video, learn from the
from training data.. how to predict
performance from that from the input
input data and use that predicted
performance to decide whether or not
we're going to use that input okay, a
little bit like in my for example.. if my..
if I sense that my vision system is
impaired, I'm going to slow down, I'm
going to do something different.. that's
basically the idea.. anticipating failures of the
vision system... it's not too important all
this is learn, this is using some neural
networks, as usual to do this, the
important thing is that we can learn
this... this directly, now there is a
question of how do you train such a
system right and the way this is train
in this case is to... have a ground truth
of the 3d environment, this is done as a
detail using stereo cameras on the.. on
the drone at training time, since we have
gone through... we know which
trajectories are clear and which are
obstructed, so we have gone through this
on the trajectory and we can then
measure the performance of the system,
meaning compare the actual trajectory
distribution with the one that I get
from a... by using my vision system okay, if
it's... if it's the same.. then it's good
performance, if it's not the same.... it's
very different.. then it's bad performance
okay, that's the ground rules if you will,
for the training system and from that we
can learn how to do this this prediction.
okay so we can look at... the how well we
predict those those failure those
failure... those failures of the.. of the
system and we can then plot basically
this detection rate, we can plot as an
our C curve, how
well we detect those those failures and
this is the the example here... false
positive rate and true positive rate on
detecting those those failures, the red
here.. the red curve here is what will
happen.. if we were to use just the
confidence on the vision system itself
okay.. any vision system is going to give
your confidence and internal confidence
on its own output and you could use that
to do this this prediction, but it turns
out we do much better by doing this.. this
introspection which is not surprising
because the confidence of the region
system relies on the fact that the input
is a viable input, so we can learn a
valuable presentation.. if you will of the
performance.. performance prediction, will
show you this example here.. let's see
this video yeah this is going to work
here... so this is again the example of the
thing flying here, what you have on the
right is a scale.. they are going to show
how confidence is in the in the output,
when the scale is high.. it means high
probability of failure.. which is what
happened right here, what it does at that
point is to enter a different mode of
flying basically stops and look around
until it gets enough data to have enough
confidence now on its on its output okay.
So this is implemented that behaviour,
that I was saying earlier... to detect when
the system itself from self
introspection, when the distance is, when
the system itself is going or may fail
because of the of the type... of the type
of input. If we measure the... the
performance I don't know what this thing
here is but okay.. if we measure the
performance by seeing how far we can fly
without manual intervention... the green
bar here indicate the flying with
introspection in other words having this
failure detection and being able to
recover from it and the the red bar here
indicates without the introspection and
not surprisingly we find that you know
you can go much further and as a much
more robust... much more robust execution. We can look at.. what it.. what it form the
if you go back and see what failures it
detects... it detects the things that some
of the things that you would expect since like you know
elimination issues... seems like rotation
of motion, which in this case affects the
quality of the 3d reconstruction etc. So
of course, some of those things are not
surprising that those are the kind of
failure modes that you would expect.. the
important thing is that.. it's able to
learn how to detect those things
automatically... using a you know.. a unified
unified procedure to do this, alright !
let's keep those things.. this is another
another example here of the.. of the same
thing, this shows only the depth map here,
so here we have basically good
performance of the vision system and
you're going to see a case here where
things get very very sparse here and you
have a high probability of failure of
the system and it's going to enter again
a recovery mode looking around until it
gets again good data and then flies
again okay. So that's basically the idea
of introspection.... yes yeah.. yeah [Answering audience question :] So let's be very careful here here, what I'm
talking about is ideas on how to deal
with those failures of the vision system,
where I have not addressed is what do
you do with that information.. you know,
what strategy do you use, once you have
that information and.. and so I don't have
I don't have an answer to do it to that
that.. that part is still very specific to
the system at hand okay.
[Audience question:]So the forces on aircraft are one-tenth the -9 error rate, so what would this get us to ?
[Marshall :] Oh ! So again I don't have the answer in this
in this context, but in the.. in the case
of the drone we basically.. were able to
for example in those experiment to
reduce the gear rate by.. by a factor of
10 or something like that, it's.. it's very
it's very different it's very dependent
of course on the.. the system that goes
around it and and the way the.. way that
shows, it's more looking at the principle
of being able to.. to learn to
characterize the performance
automatically, so one important thing
that I mentioned in presenting this.. is
how it is trained and I mentioned this
idea of measuring.. you know which
trajectories are correct and incorrect
on all this, meaning that it's.. it's
trained on the behavior of the overall
system, it's not trained on the output of
the visual system and that brings up an
important point.. which is a basic problem
that I think is pretty much still open,
which is the you know the choice of the
right error metric or loss function that
we use in building those visual system,
so let me illustrate this on this
example.. this is an example here from a
semantic level in test so you have an
image and you label pixels according to
different types of of regions, those are
two different outputs and it's clear to
everybody that the output on the top is
better than the output on the bottom
because it's that big green mistake here,
let's miss table, but if you think about
it,, why do we say that it's better ? We say
that it's better because the error
matrix, the loss function, if you will
that we use is the percentage of pixels
miss level.. right ! What we.. when we trained
such a.. such a task.. we try to maximize
the number of correctly labeled pixels
and in fact the objective here
implicitly is to label every pixel
correctly, the problem with that is that from a
robotic to autonomy point of view, there
is really no application.. where you need
to label every pixel correctly, it just
doesn't...doesn't exist okay. Similarly
when you think of the the flying
application that I showed.. that I showed
earlier, the... you could measure the
performance of the system by looking at
the accuracy of the 3d representation,
but there again you don't care if the..
the point cloud is you know five
millimeter accurate or 2 millimeter
accurate, the only thing you care about
is that your trajectories are correctly
evaluated.. that's it.. that's the only
thing you care about, so the basic
problem is that in many cases we are
using the wrong error metric or
objective function in the vision box
that we have as part of the system, for
example the number of pixel labeled or
accuracy of a 3d point cloud and we're
using one that is actually harder than
the actual test.. that we're trying to
solve, if I try to label every pixel
correctly.. I'm actually solving your task
that is harder than what I need to.. the
only thing I need to solve.. the.. the
robotic test. So this is a.. this is a
basic issue that we have in trying to
implement those things and trying to
really use vision effectively indoors in
those testers.. so in some cases it is
possible so they are shown an example
here in flying, there are many other
examples in manipulation for example,
where there the the right objective
function if you will is.. whether I've
grasped an object or not for example, so
that's a well-defined objective function..
I can label things, I can measure things
that way.. but if you look at more complex
tasks like this.. test on semantic
navigation for example, how do you
characterize an end-to-end metric for
this for this test, so that's a major
challenge and that's the idea of.. of
being able to use.. you know the right the
right error metric.. okay, so that was the
idea of introspection and understanding
the performance of the other system, let
me look at it's second problem.. which is
the problem of limited supervision and
the idea here is this, if you look again
at my little system here.. the data that
is used this perception box has to be
trained from from somewhere.. from some
data and so the addition to this is that,
we have lots of data here.. that we are
going to use for training this.. this
perception box. The problem of course is
that.. that data is generally supervised
data, labelled data... it also has to be
very large very.. large data, this is a
major limitation in being able to use
those systems.... simply because in the kind
of online applications that I showed
earlier, we don't have the luxury of
being able to acquire very large amount
of annotated supervised data, we also
don't have the luxury of having time and
computation to retrain, every time we
want to introduce a new concept, every
time we change.. we change the operating
condition. So the second area that we
look at is this idea of reducing
supervision okay and being able to learn
those those visual models with as little
data as possible.. ok so again.. the idea
currently is that we train those.. those
visual model using a very large amount
of labeled data, what we want to do is to
be able to do this from a very few
example or maybe even just one example
ok.. and.. and the reason why that is
should be possible of course, is that
even though we have only one example
here, we might have a lot of experience
offline with many many training example
and many many different visual.. visual
tasks. So the idea here is to be able to
design those.. those vision systems so
that instead of just being trained from
you know.. one set.. one level set of data
to use the prior experience,
to be able to learn those... those. those
models okay ... so imagine that you have to
train your system for many many
different tasks right.. you now have a new
tasks, a new class that you want to deal
with, a new type of object or something,
you want to use all that prior
experience to be able to train quickly
on that new task with very few examples
okay and if we can do that, we can reduce
the amount of supervision that is needed
and have more practical system. Now this
falls on the various heading you know
meta learning, learning to learn things..
like this which basically means again
trying to use the prior experience
having learned many models, to now learn
quickly on a new... on the new model. So let me tell you some.. some of the things
that we've been looking at in this.. in
this space and I'll talk a little bit
about one aspect of that - which I call
model dynamics, which is the idea of not
just reasoning in.. in feature space or in
image space.. as in offer.. as is often the
case in.. in in learning, but in the model
space, so let me explain what I mean here...
let's say that we have a some kind of
classification task... simple image
classification test okay, let's say you
know living rooms images versus non
living room images... from one of those
data set, so one way to think about it in
data space... is to think of that as each
image corresponding to some feature.. is
some.. in some very high dimensional space maybe and those features are constrained
to be on some you know subset of that of
that feature space,  that... that might look
like a something like that and then we
have the other class here which is also
features in that in that feature space,
in that data space somewhere else and
the classification problem can be
visualized as you know learning some
kind of boundary between the nothing you
hear, this is basically.. what what one
does in machine learning , setting
typically.  Now the dual way of thinking
about this
is.. like this.. you can think of this task
classifying living room versus non
living room, as generating some kind of
classifier, which I use trade by this
blue screwball here and this classifier
is basically a big vector you know, it's
it's... you can think of it living in a
high dimensional space as well and it's
also constrained because you know visual
task you have a certain structure... they
construe this those classifiers are
implicitly constrained to be on some
subspace.. some subset of that high
dimensional space and if you take
another task, say classifying another
another type of.. another type of class
then you have another model that
somewhere else in that high dimensional
space okay.. so the idea here is is to
understand the relationship between
those models, with the idea that if I
have trained many many one model like
this.. this is my past experience, then I
can have an idea of the structure of
that space of models and given a new
task.. maybe I can use that knowledge to
learn quickly on that new test okay. So
that's basically the idea okay, so let me
show you an example here of using that
that idea yeah or it can be it does not
have to be similar it can be eaten yeah
in fact it's better if it's not because
if you have lots of those task conceptually
no listen.. am NOT saying anything
mathematically okay... conceptually use
sampling that... that that set of tasks
right, so it's the dual idea of sampling
that you know the feature space... so you
have dense sampling instead of that you're
sampling in model space, so let me give
you an example of what what we can do
with this... this is this kind of idea,
suppose that I train a classifier for
this category dot with only one example,
so with only one example I'm going to
get a classifier... for some reason on the
screen this is very dark.. as a classifier
your line here okay. So I tried this
classifier by your line here and it's
not going to be a very good one right..
because you have only one example that
you use to do this, then if
put more example.. your classifier is
going to be a little bit better and then
with more example it's going to be a
better and better okay .... it's a little bit
hard to see I'm sorry about this but
it's right here, what's moving closer
this way to the boundary okay and so
forth okay.. until you have really a lot of
example and now you have to the final..
you know, what you would call the best
classifier alright. So the question now
is could you somehow say something about
how we went from that guy to this guy,
basically some kind of transformation
here that goes from this classifier to
small sample classifier to the large
sample classifier okay and the idea is
if you had enough examples of this kind
of situation, in other words enough
example where you have the classifier
for one sample and the classifier with
lots of samples... which is somewhere else
then maybe you can learn how to figure
out the transformation between it okay.
So that's basically.. that's basically the
the idea right, we essentially regress
the.. what we classify would have been
with a large sample from the one day it
is from the from the small sample now,
the idea behind this of course is that
this there would be no reason for such a
for such a thing to be useful to exist
except for the fact that the underlying
assumption is that the visual world is
in fact highly constrained and the
corresponding classify all the
corresponding models are also highly
constrained, they don't change in
completely arbitrary way and that's
basically the idea here, so how this is done
in in practice.. this is done using a
network that takes as input the model
learned with few example and produces
our output the model that we'd have been
learned... with... with a lot of examples okay and of course it should have the
behavior that the network is simpler.. if
you have more examples, in fact if you
have a lot of examples
or the network should do nothing, it
should be the identity and as you have
fewer and fewer examples, it's going to
you're going to need a deeper and deeper
network to generate that regression ok.
So that's kind of the.. the idea of what
we what we look at what we look at here,
now the interesting thing about this is
that, this allows us to implement
something that is a very natural.. should
be a very natural behavior in learning
those things, imagine that you have a lot
of domain basically of tasks in some
region of that space, so you have lots of
categories.. in this example this 400 from
one of those data sets, that say you have
400 categories and some of those
categories are going to have lots of
examples right, some of those categories
are going to have very few examples ok,
this is basically those categories 400
ordered based on the number of examples
in each... in each category, so this.. this is
what happens in real life right right in real
life... you're going to have lots a few
things that are very common and very
easy to find examples for training and
then many things that are very rare with
very few example, so the behavior that
you would like to see happen - is to use
those categories here, to help train on
these categories.. basically to use the
stuff that you can get data on very
easily... to help train on the stuff for
which you have very little data. So
that's basically what what we can do
there... because we can learn this
transformation using the stuff for which
we have lots of example and then use it
on the stuff for which we have very few
very few example and this is... this is a
view of what happens here so again I
have my four hundred categories you can
think of that as four hundred you know
classification... classification tasks and
they are ordered according to the number
of examples in the category here,
on the left with the most example on the
left and few
example on the right,.. it's right on the
right.. extreme right, those categories are
only one one example, what the red bar,
the red bar here show are the increase
in performance... that we get by using this
technique basically by training this
transformation using the categories with
lots of example... to predict that that
that transformation and you see that it
has the dissolve behavior which is that
it doesn't change anything for those
common categories, we already have a lot
of data it doesn't do anything and then
it increase the performance dramatically
on those that have very few very few
example. So this... this implements this
natural idea again to to have the stock
that's easy
helping to stop.... that it that is hot all
right, so that's one idea, the another
idea that we've worked on very recently,
especially is the idea of hallucinating
data.. is the idea of hallucinating data and
this is now a very common idea now in in
learning which is the idea of using our
initial training data - to create new data
basically hallucinate new data, so that we
can train a better better model... so this
is just a quick view of the idea here,
let's say that again you have your
training data with very few examples
because you don't have yet the luxury of
labeling a lot of examples, what we're
going to do is to use those initial
examples... to pass it through a box here
that is going to generate new a new
example here... a new sample.. this different ways of doing that nowadays, I'm not
going to detail that but it's just going
to generate a new new sample and then if
we do that enough time... we can augment
our training data with those examples, so
now we have the real data here... we all
example here and the one that we
generate here, the new ones combine them
and then train.. train the classifier with
this combined data and then the
interesting thing is that we can,
a training time... we can adjust the
primary basically train the thing so
that the system basically learns how
to generate those example... so that it
best benefit the training okay, so that's
that's basically... the that's basically
the idea here, so this combines two idea
the first one was this idea of
regressing from source sample, the second
one is the idea of generating new data
to help in the endo train and we see
major improvement the the orange bar and
the yellow bar.. our technique compared to
to state-of-the-art techniques in this
in this area on some classification
tasks. I mentioned all of this in the
context of classification, which is not
that useful in.. in practical application
so we can do the same thing for
detection, this is actually new results
that are not published yet but.... those are
is also a standard imagenet type of task,
where the original axis is the number of
training examples starting with one
example... so that's one shot and one shot
learning, the red the red line here
indicate the the techniques that I just
described and the black line indicate
reference technique.. euro fast erosion
and etc, the important thing here is that
for very few example here we had very
low performance here, here we get higher
performance to the level where we can
basically bootstrap the training and be
able to do things that are so nice
supervised and other techniques okay.
This... okay... the last thing that I want to
talk about, because I have to be careful
about the time.. is a one key aspects of
the the kind of robotic system that we
saw, which is the time aspect okay and to
explain that let me go back to my basic
diagram here... the way it's drawn here is
the standard way of building those
things, we have input I'm going to
compute
for Y here and then I'm going to get
to get an output and the way it's drawn
here... I have to wait until this is done.
how much time.. how whatever time it takes
to be able to use this output, the
problem is if you that helicopter or
that car or whatever of the system, you
may not have the luxury to wait that
long.. maybe you need an answer right now
okay and in fact maybe there is a
situation where for whatever reason you
have even less resources and even less
time and now you need an answer you know
right now maybe you don't even care if
the answer is not quite the best you
could get... but you need some indication
right now okay you know if you think of
a very simple obstacle detection... you
know for mobile robot right, maybe
because of the limited computational
budget I have, I need an answer right now
to be able to plan my motion.. I don't
care if I have the precise location of
the thing... if I know that this may be
something on the right and nothing on
the left that's good enough, but I need
that information, so that means that we
need to think of the perception system
in a different way. We need to think of...
instead of thinking of a perception
module as something that takes an input
produces an output, whatever time it
takes.. whatever time it takes, we need to
think of the perception system as being
able to produce some usable output, no
matter what computational budget and no
matter what time the overall systems
need... need that answer okay, so that means
we need basically a perception system
that gives us an output at any time and
in fact that's a technical term in
computer science... the there's a concept
in fact I have a reference here from 96,
there is a concept in computer science
of any time algorithm and any time I go
home it's an algorithm such that if you
interrupt the execution at any time, you
get the best possible with all that you
could get within that time.
So for example if you want to invert a
matrix instead of designing implementing
the algorithm, so that you have to wait
until the entire matrix invert it is
computed, you can design the algorithms
such that if you interrupt at any time
you get the best approximation of the
inverse of that matrix that you could
get within that time okay. Now that's a
different way of thinking about how to
design those rows those algorithms, so
this more formal definition of this... this
is a reference here from a paper in 96
that define... more formally what that
means, we want the algorithm to be
interrupted more meaning that it can
give an answer at any time, you want
monotonicity meaning that the answers do
not get worse over time, that's the
minimal that you want okay, you want
basically to have better and better
answer.. better and better interpretation
of the environment over time and the
third one is a diminishing return, in
other words that... this is important
actually because you need to know how
long you wait, so you want the the answer
to improve less and less over time
basically diminishing return, if we can
do that then we can have system that are
anytime and implement this idea of being
able to provide an answer that is usable
for reasoning and decision-making no
matter what computational budget is, so let
me give you just a quick while so this
is an example here in terms of semantic
segmentation.. which in an ideal world
what you would like is to have, if you
have very little time... your
interpretation of the scene maybe look
something like this.. very close
interpretation.. if you have a lot more
time it's going to look like, that the
basic.. basic idea here okay. So let me
show you some of the things that.. that
we've been doing in that.. in that space
just to give you the rough idea of what
what we're trying to do here, so one way
to look at that is using again a neural
network kind of architecture, this is a
completely you know kind of high-level
idealized view of the system here, you
have the
front layers, different levels in that in
that network, the typical approach would
look at the last level, the output at the
last level and try to optimize with
respect to a loss from the last level..
instead of that we can general or in
addition to that we can generate
intermediate losses at different levels
of this... of this architecture and then
try to optimize all of those losses
together, so that if I look at the system
or if I interrupt the system at any.. any
stage here, I will get a sensible output
or useful output okay. Now the problem is how to
optimize those things together, so one
way to do that is to take all those
losses and combine them example in your
linear combination and train the system
instead of using the loss on the last
layer on... on on the output, to use this
combined... combined loss okay with the
idea again that if we do that, we not
just trying to get the best possible
output in the end.. we also trying to get
sensible output at intermediate stages
in.. in the network. The key challenge here
the key difficulty... is how to choose
those.. those weights and there's been of
course prior work on this, the simple
thing that could be done is to use
constant weight and display your work on
on doing this... you could think of using
linear weight so work from two years ago
and on doing this, it turns out that we
can show this some theoretical result
that we have on this as well as
empirical result, that I'm going to show
that a good way of choosing those
weights is to adjust it dynamically such
that the ways... the weight is inversely
proportional to the loss at that... at that
layer... another way to say that.. is that
we're going to get more weight to the
the layers that have a lower lower loss
okay
and it turns out that you get a good behaviour
of the system in.. in that way okay,
so again basic idea.. the main point here
is this basic idea of any time
prediction as an important component of
your system, being able to make decision
with arbitrary budgeted resources, the
second point here is this kind of
structure... with waiting a structure with
this strategy to set the weight, yeah so
so that depends a little bit on the on
the application, but for the let's say in
the semantic segmentation that would be
... those intermediate level you can
reconstruct an output labeling and you
compute the error on that which I go. Now
it has to be adapted a little bit on
each application depending on the actual
format of the output, but in most cases
you can you can read okay. So this is..
this is an example here on a
classification test..
okay were some the classification test
it's even easier right because you can
you can just have the right output, so
here this is what this shows, the... the
vertical axis shows the loss, the loss of
performance at each of those level right,
so building blocks means those levels
from 1 to 14 in this particular example
okay,
loss of performance means that if it's
in an ideal world... it should be 0, in
other words I should get at each level
...the best possible performance, that I can
direct can possibly get with that number
of levels, that's the data grid that's
the reference here see ok, that's the
best I can possibly ok.. this is what you
get with the constant weight and with a
constant weight you have this really
annoying thing... that we pretty close here
to the optimal except for the final
output...
meaning that you know you can interrupt
here and get some sensible output, except
that the overall performance of the
network goes down dramatically, which is
not what you want..
you still want of course the the final
output to be to be correct, this is using
other strategy here and this the right
one is using the strategy direction
The behavior that you want is... you want this redline to be as close as possible to
zero, so you don't lose performance
because it's never going to be at zero
that's impossible... right zero means it's
an Oracle where we've trained the system
completely with two layers, with four
layers on so forth okay. So that's you
cannot reach that you want to be as
close as possible and you want to
maintain the funnel the final
performance, so basically you want
something flat that's close, so that
shows you via the behavior there okay..
we've used this with the various
classification.. classification task and
again for segmentation test, let me
, to answer your question, this is
how it looks like for semantic
segmentation right... you get intermediate
results that are we find as you go
further in in the network right.. so
hopefully that implements the kind of
thing I was saying, you know if you
interrupt at any time, you have a course
output right... which is a higher error but
is still usable output, if you need to
make that decision quickly.. let me skip
some of the details here of this. So this
shows basically kind of the evolution of
the output throughout the network, going
for me cause noisy output to the final
refined output all right... okay so
what I showed is some ideas on doing
introspection or input filtering..
basically I mean the visual system being
to understand its own performance and be
able to adjust its behavior, I talked a
little bit about this problem of being
able to design the error matrix or the
objective function.. such that it is tuned
to the to the overall test.. one thing I
did not talk about is the idea of
multiple hypothesis... which is the idea of
instead of generating your single output
from the visual system, a single
interpretation, for example a single
semantic leveling of the scene to
generate multiple hypotheses... that you
don't have time to talk about that...
But this is also an important thing because
if you don't do that... you assume
basically that whatever interpretation
is generated by the division system is
basically the only one that you can
reason on and you cannot go back in the
reasoning chain. I talked about the
anytime prediction and a small
this idea of small sample training.. so
that small sample learning to reduce the
amount of supervision. So what does a few
of the topics that that we look at in
vision in the context of robotics and
autonomy and I think I should stop here
because it's right about on time... "thank
you".
yeah
[Audience question :] Thanks for presentation I have a
question about any time prediction.. yeah
so as I understand busy you want at
every stage the L1 or the Li
should be reasonable prediction, but the
way you optimize the loss basically the
waiting if the loss is not good, this
weight goes down..
so is that loss basically not motivating...
[Marshall :] There's no, so there is an important... yeah
it's counterintuitive basically, you
would expect the waiting to be the other
one right... the reason why it is that way
is that.. if the loss is very small but if
you put this one one level that.. that
works very well okay... the loss is very
small, if you're not careful because it's
the loss is very small, if you give it a
very low weight, then it's going to in a
way get neglected in the later stage of
training and what's going to happen is
that this loss is going to go up again,
you see one, so you need to make sure
that if a level is so low loss has been
is that good level of training... that you
heighten the weight.. so that it gets
preserved if you were that's.. that's the
larger the intuition as you where behind
this....
yeah
[Anna :] The discussion and the filth water you
know.. we should rather invest more time
and resources to develop and to end
learning systems or whether autonomous
systems are it's better to keep them
modular and one of the key you know
arguments behind modular is that you
know you can control each part of the
system... it's more interpretable
potentially, so what would be your take ?
[Marshall :] Oh well ! yeah so I'm not sure because as
I said it's still an open problem, but in
the introspection part the interesting
thing that's going on there is that the,
the system is designed in a modular
fashion right... the other way, the visual
system is trained separately and
designed separately like you said, but
this performance prediction part is end
to end.. so you could actually have both
right, where you you have the system
trained your modular fashion, but then
it's performance prediction and
performance control and overall control
of the system being done unto him okay...
and I think that's I think for from what
we've done... that's that's a good
compromise.. because training fully
end-to-end, the problem there is of
course you completely lose control on
what the system does or any explained
ability at all of the system all right,
doing it purely modular then you have
the pollen.. that that I explain like that
that you have mismatch between the
individual module and your overall test,
so doing it modular and then having the
overall performance characterizable
and predictable is still compromised.
[Claps]
