TOMASO POGGIO: I'm
Tomaso [INAUDIBLE]
I am welcoming all
of you on behalf
of the Center for Brains,
Minds, and Machines.
Which is a center at
MIT with other partners
across the country--
the main one being Harvard.
It is one of 14 large NSF
centers in all areas of science
and technology.
This particular one is about
the science and the engineering
of intelligence.
We had today our external
advisory committee meeting.
This is a great set
of people and friends
and leaders with
great wisdom that
gave us, and give us over the
years, very useful advice.
And as some of you may remember
last year for our meeting
we heard from Demis
Hassabis, also
a member of our advisory
committee, about AlphaGo.
That was the first talk
he gave after AlphaGo
winning in Seoul
against Lee Sedol,
the kind of unofficial
world champion of Go.
And today we'll hear about Amnon
and Mobileye and autonomous
driving which I think
is the top achievement
of modern deep
learning and AI so far.
I'm glad and proud to introduce
to you Amnon Shashua, founder
of Mobileye, professor
at the Hebrew University.
I met Amnon first time
when he arrived at MIT
for his graduate studies.
This was '89?
At the AI lab.
And he was a student
of Shimon Ullman,
and eventually mine
during his PhD.
He stayed with me for a postdoc.
My main claim to fame
in terms of training him
is to get him interested
in entrepreneurship.
We went together
on a business trip
to Japan that led I guess
to a failed startup.
But this was followed by
successful companies he's
funded, CogniTens, Mobileye,
and he is, of course,
one of the most successful
among my students and postdocs.
And is also one of the
greatest human beings I know.
It started with a master's
thesis on saliency computation
in vision in his PhD.
He dealt with invariants
to illumination.
And then he wrote
a paper, I think
at the end of the
fellowship in my lab,
about multiple view geometry.
Which introduced a fundamental
algebraic relationship
between three views,
which is known today
as the trifocal tensor.
I don't think I understand
the mathematics even today,
but that's a side comment.
And then this found also
theoretical practical
applications ranging
from 3D reconstruction,
calibrating cameras, robotics
navigation, and so on.
He got several Best Paper
awards, MARR prize in 2001,
and so on, and so on.
He was chairman of the
School of Engineering
and Computer Science
at Hebrew University--
2003, 2005.
In '95 he started CogniTens.
And in 2000 he founded Mobileye.
And in 2012 another
company Orcam,
about which you may speak?
Or no?
OK.
So the latter two-- so
Orcam and Mobileye--
are both partners of
CBMM, and are the ones
I'm most proud to
speak about in terms
of pioneering the
technology of intelligence.
So Mobileye is an amazing case.
I think it is the prototypical
success of machine learning
and computer vision in the
last year because of the extent
and the success
of the underlying
mixture of sophisticated theory
and impressive technological
applications.
And I want to show
you a little video,
you will see videos
from Amnon today.
This one was something we
did in my group in '95.
So that's 22 years ago.
This was a project
with Daimler-Benz.
It was one of the first
applications in computer
visual machine learning.
We trained something similar
to Support Vector Machines
with about 2000
images of pedestrians.
And then we ran
this system, there
was a computer in the
trunk of a Mercedes.
This was done in
[INAUDIBLE] Germany.
And the system was able
to detect pedestrians,
and making some false detection
as well as you can see.
There is a traffic
light we get classified
as a pedestrian on one frame.
And so we had, at that
time, an error rate of one
error every three frames.
And we were very happy about it.
But this is 10 errors
per second, right?
So now I think Mobileye has
one error over, I don't know,
30,000 miles--
or something like that.
And so this would mean
about 1 million times
rough order of magnitude,
better accuracy.
So doubling accuracy by two.
Doubling accuracy every
year for 20 years.
About 1 million times.
That's machine learning, you
know, the progress of it.
Amnon, your turn.
[APPLAUSE]
AMNON SHASHUA: Thank you, Tommy.
I have my adapter here.
So you all received the flyer.
We'll take care of it after--
during the Q&A session
we'll talk about this flyer.
So it was supposed to
be an intimate lecture.
I'm a bit overwhelmed
of what Tommy did to me.
So, you know, autonomous
driving captures everyone's
imagination.
And there are a lot
of aspects to it.
There are policy aspects,
engineering aspects.
So let's do the following--
so I'll give a lot of time
for Q&A since this is an area
that people feel very strongly
about, and would like to
ask lots of questions.
So I have say 45 minutes, 30
minutes or less, I'll talk
and then we'll have lots
of questions and answers.
And I'll focus only on
the engineering side.
Now during the Q&A I'll also
put on my executive hat.
So you can ask questions
about policy and so forth.
But now I'll focus only
on the engineering side
since we are here at MIT.
That's really what matters--
engineering.
So I'll talk about the
fundamental problems that
underlie autonomous driving.
So we're talking about
machine learning,
artificial intelligence.
I'll focus on where exactly is
AI hidden in all of this mix.
Because people talk
about AI almost about--
everything people
talk about is AI.
I'll be more specific.
Where it is exactly
hidden in the equation.
I'll talk about
different approaches.
There's more than one
approach to do things.
I'll talk about
the wrong approach
and the right approach.
And guess who does
the right approach?
[LAUGHTER]
AMNON SHASHUA: OK?
So let's begin.
So in order to do
autonomous driving there
are three areas that
we need to master.
And I'm starting from
the least complicated
and then moving to
the most complicated.
So the least complicated
is all about sensing.
So sensing, we have cameras--
say 360 degrees--
we have radars,
we have laser scanners
called Lidars.
And we have high
performance computing,
very sophisticated silicon,
that receive all this data
into the silicon.
And then we have
sophisticated algorithms
that interpret the data.
So interpreting the
data is building
an environmental model.
We need to know where
all the road users--
vehicles, pedestrians,
cyclists-- are.
We need to know where all
the past delimiters are--
like curbs, barriers,
guardrails--
where we can drive, where we
cannot drive, the free space.
We need to find, of course, the
traffic lights, traffic signs.
And most complicated
is the drivable paths.
When we look at the road we have
semantic meaning to everything.
We know that this lane is a
solid lane, fragmented lane,
it's a road edge.
This lane leads
to a highway exit.
The lane and the one on the
left are going to merge.
There are pavement markings.
We take all this
information and we
understand-- we understand
the semantics underlying them.
So this is one area where
artificial intelligence
is hidden.
So this is sensing.
It is relatively
well-defined although there's
more than one approach.
And I'll focus what
are the differences
between those approaches.
The second area
is about mapping.
So mapping is not
only technology,
it's also a logistical problem.
So I'm not talking about
the navigation maps,
I'm talking about
very precise maps.
Precise meaning you need
to localize yourself
in that map at an accuracy
of 10 centimeters.
So GPS would not give
us this accuracy.
It could give us this accuracy
when we're in open areas
where differential
GPS are the case.
But when we are
in urban settings,
you cannot get a consistent
10 centimeter accuracy.
So one thing is localization,
10 centimeter accuracy.
Second is the richness
of information.
We need to know
where all the lane
markings are, their semantic
meaning, the drivable paths.
All what I talked
about sensing, just
remove the road users, the
vehicles, the pedestrians.
What is left is the
building blocks of maps.
And these are called
high definition maps.
So it's not only a
technological issue,
how do you build these maps,
it's a logistical issue.
How do you build
them in a way that
is very low cost and scalable?
And so it's not that
you want to support
only one or two cities--
you know, spend a lot of
efforts and map mountain view.
Big deal, right?
We want to map the entire
U.S. How do you scale up?
Second, is how do you
create a live map?
Because if this
map is going to be
critical to support
autonomous driving then
it has to be always correct.
Always correct meaning
that if something changes
in the environment
we would expect
that this change would be
reflected in the map almost
instantaneously.
So kind of near real time.
So how do you do that?
Because traditional map making
is very, very time consuming,
very laborious, lots
and lots of manpower.
And it needs to be
invested-- very, very costly.
So it's also a
logistical problem.
But why is it a
logistical problem?
Because-- and here I'm
putting on an executive hat--
the idea is to build an economy,
it's to build a business.
Now if the cost of
supporting autonomous driving
is more than the cost of
having a driver drive the car--
and these bloody
maps could exactly
make us reach that point--
then we will not
have an economy.
We'll have a nice
science project,
but it will not
create a new economy.
So this is a critical issue.
How do you build these maps--
how do you build them in a
way that that is scaling up?
The killer is the
driving policy.
What is driving policy?
This is where most of the
artificial intelligence
is hidden.
And this is largely
an open problem.
So if people tell you that
autonomous driving is just
around the corner,
they don't know
what they're talking about.
This is really the Achilles
heel of the entire industry.
So driving policy is
all about negotiating.
It's the reason we
take driving lessons.
We don't take driving lessons
because we train our senses.
We take driving
lessons because we
want to learn how to
negotiate in dense traffic.
And when you negotiate
in dense traffic--
the culture of negotiation
is really location dependent.
In Boston we drive very, very
differently than in California.
So it's location dependent.
And we negotiate.
We don't negotiate by
talking to each other,
we negotiate by motion.
Our motions signal to the
other drivers our intention.
And there could be dead locks.
We want to do it in
a way that is safe
so we don't have accidents.
And we want to be able
to do it in a way that
mimics human behavior
because if we are the only
conservative vehicle
on the road we're
going to obstruct traffic.
And there are
thousands like us we're
going to clog the entire city.
So the robotic cars need
to drive like humans.
They need to drive like
humans, but on the other hand
they need to be safe.
They cannot drive recklessly.
So this fine line
between driving
like a human, on the
other hand, driving safely
is really an open problem.
And I'll focus
about this in a bit.
So these are the three--
I call them pillars-- the three
pillars that we need to handle.
So sensing is very,
very difficult.
But this is the
easiest among them all.
Mapping is a big
logistical problem.
And driving policy is
mostly an open problem.
So I'll show now four clips
that kind of set the stage,
and then I'll focus
a bit more technical.
So what I'm going
to show here is
a clip of how sensing looks
like at the output level.
So we'll look under the hood.
There are eight
cameras around the car,
there are also radars
and laser scanners.
But I'm showing only the
output of the visual sensing.
So what you're going to
see when I run this clip--
these are the 3D bounding
boxes around cars.
This green carpet
signals the free space.
You're going to see also
traffic signs, traffic lights.
And you're going to see
this from multiple views.
The lane markings, also
pedestrians, traffic lights.
A few pedestrians.
And at the edges of this
free space being shown here,
there's also
semantic information.
Semantic information of whether
this is a curb, a solid lane,
fragmented lane, barrier,
guardrail, and so forth.
So this is what
sensing is about.
It tells me where all
the road users are,
the path deli-meters.
And the dry bulb paths and
I'll get back to sensing later
and I'll explain what is
really difficult in this.
And that the three layers--
one is relatively
straightforward,
the other one is
more complicated,
and the third one is
really a big issue.
So this is sensing.
Mapping.
So this is also a clip.
What I'm going to show here--
the mapping is done in
a crowd-sourced way.
So this flyer becomes relevant.
So 2018, we're going to have
two million cars by Volkswagen
and BMW generating data.
So data that we generate in
our chip inside these cars.
What is this data?
The data is about harvesting
the lane information-- all
these drivable paths
that I mentioned before,
and landmarks,
traffic signs, poles.
There's a vocabulary of
about 20,000 different items
that we recognize.
Pavement markings, signs,
poles, reflectors, all sorts.
Fixed things in the
scene that the vehicle
can use for localization.
So that the drivable
paths are building
blocks of building the
high-definition map,
and the landmarks are
the building blocks
for localization.
So what you're
going to see here is
that these are the lane
marks, or the drivable paths.
These circles are the landmarks.
And you're seeing
here two projections.
This is a projection
onto Google Earth.
So this projection gives
you a sense of accuracy--
say about 50
centimeter accuracy.
So if the projection of
the map onto the scene
is such that you don't see the
line aligned with a two lane
mark, you know that we have here
about 50 centimeter accuracy.
This is a projection
onto the field of view
onto the image space
when the car is driving.
This gives you a sense of
the accuracy of centimeters.
Because if this lane
is not sitting exactly
on the lane mark, we're talking
about centimeters misaccuracy.
So if we ran this--
one moment-- OK.
So you can see how
accurate all of this is--
once you'll see the
lane marks, you'll
see how accurate this is.
So we're talking here
about an accuracy
a few centimeters and
all this map information
is generated automatically.
There is no manual intervention.
We have-- all cars which
have a driving assist
module, which is a front facing
camera with our processing chip
inside, is generating
this kind of data.
It's about 10 kilobytes
per kilometer.
It's sent to the cloud.
And the cloud is
being aggregated
and the high definition
map is being built.
And then when the car drives
it goes and identifies
these landmarks and uses the
landmarks to localize itself
within this map at an accuracy
of at most 10 centimeters.
On average it's even
five or four centimeters.
And then that is being used
as redundancy for sensing.
Why is that important?
In order to guarantee
safety we need
to have redundancy
in whatever we do.
So when we go and
detect road users
like vehicles and pedestrians
we have multiple sensors
to get redundancy.
We have cameras, we have
radars, we have laser scanners.
When we're talking about
sensing the roadway--
sensing the drivable paths--
there's only one sensor
that can do that.
And that's the camera
because it's texture based,
it's not shape based.
So a redundancy for
the camera is the map.
Without the map we don't
have redundant information.
The map also provides
us foresight.
Once we know we
are in the map, we
can know where that
path is leading
beyond the range of sensing.
Up to infinity basically.
So this is critical
to build this map.
And this map is being built
through crowd-sourcing.
Let me show you another--
this is in London.
So that was in Las
Vegas, this is in London.
So all these are
landmarks being detected.
And this is a projection
of the map data
onto Google Earth and projection
onto the driving scene.
And you can see this is very,
very-- it's highly accurate.
So this is mapping.
So now when we put
things together
I'm going to show you a clip.
This was a demo that we
built together with Delphi.
Delphi is one of
our partners, it's
a tier one company supplier
to the car industry.
So we built together
a vehicle that
drove hands free in a
complicated city environment.
Which was about a six mile
stretch of city and highway.
And it had about 100 drives
a day, day and night,
for four days.
So it's not just one
drive where everything
is carefully planned.
It was 400 drives.
And if something can go
wrong it will go wrong.
Right?
And nothing-- it
was really perfect.
So this is a reporter.
It's a one minute clip
of a reporter reporting
about what he sees.
And it kind of puts
things together.
Let's run this.
REPORTER: I have latest version
of Delphi's autonomous research
vehicle.
This is an Audi SG5
that Delphi's fitted
with radar, Lidar.
And what's new for
this generation
is a camera system
from partner Mobileye.
That means nine cameras
around the vehicle
that give this car a better
sense of its surroundings.
During the drive on a set route
this car acted very naturally.
It was aggressive
enough but safe enough,
it felt like a human
was behind the wheel.
There's a display in
the car that showed me
what the car was seeing.
I could see when it could
see pedestrians, crosswalks,
traffic lights.
It really had a great
sense of its surroundings.
One thing that
really impressed me
is while we were
in a left-turn lane
another car cut in front
of us and the Delphi car
behaved perfectly.
Another time we also went
through a fairly long tunnel,
the car lost it's GPS connection
but still stayed on course.
And one final thing
that really impressed
me is that this car uses
crowd-sourcing to determine
it's path down the road.
It sees the path that
similarly equipped cars
have taken before it.
And so it follows that
path as well as lane lines.
Now this is still
a research vehicle.
But Delphi says
this system could
be ready for production
around 2019, which
means we could see it in a
production car around 2020
or 2021.
We'll see many more autonomous
car technology demonstrations
at CES.
So stay tuned to Roadshow.
AMNON SHASHUA: OK so
those four clips basically
set up the stage.
So now let's go
into more detail.
So let's start with--
I'm not going to spend
more time about mapping.
Although I think it's
a fascinating field,
but I want to
leave time for Q&A.
So I think you've got the idea
of the problem of mapping.
I'll focus on two
areas, on sensing
and on the driving policy.
And this is where all
the intelligence lies.
So we talk about sensing.
Where is the AI
hidden in sensing?
So the three pillars
inside sensing.
The first one is
really the obvious one,
and it is the easiest one.
You want to do object detection.
So objects are all the road
users, and traffic signs,
and traffic lights.
So detect vehicles,
detect pedestrians,
detect traffic signs,
traffic lights.
An object is
something that you can
put a bounding box around it.
And this is the sweet spot
of today's computer vision.
Anything that you can put
a bounding box around it
computer vision is
very, very good at.
And in some cases, even
better than human perception
for this particular narrow task.
And this is really an
outgrowth of driving assist.
This is what driving
assist is all about.
Driving assist is about
preventing collisions.
To prevent collisions
you need to detect
vehicles and pedestrians.
You need to do it in a
very high quality way.
Today the false positive--
as Tommy mentioned--
the false positive
of a system that
does automatic braking
on pedestrian detection
is once every 30,000
hours of driving.
So it happens maybe once in
a lifetime of owning a car.
So we're talking
about something that
is very, very high quality.
But there was a long
period of evolution
of more than a decade of
bringing this into perfection.
So this is the easiest
problem to solve.
The second problem is when you
are talking about free areas--
you want to find the free space.
So you have objects
and there is free space
in between these objects.
And the boundaries
of the free space
are the curbs, the
barriers, the guardrails.
So an image is an
input, the output
is a free-form boundary
with semantic information
along this boundary.
So this is the first
place where we're
outside the comfort zone of
classical computer vision.
We need to explain something
a bit more complicated.
This is already in
production in cars.
For example the Tesla autopilot,
the first generation auto
pilot, has this
kind of technology.
The third one is really
the most difficult one.
We're finding the
drivable paths.
So here the input is also an
image, the output is a story.
It is not a free-form boundary,
it is not a bounding box,
it is a description
of what do I see
in the scene in terms of
the roadway-- in terms
of the drivable paths.
Which lane is leading to where?
What is the semantic information
associated with the lane?
It is a story.
It is something from
a perception from--
the challenge of a
perception system,
it's much, much higher
than the previous two.
I call this strong perception.
This is where most of the
artificial intelligence that
is left it lies in.
And there's no system
today that can do that.
So this is an open problem.
So let's look at how
sensing is utilized.
And there are two approaches.
So let's start.
So the approach on the left I
call this map heavy approach,
and the approach on the right
I call it a map light approach.
And what I'm basically saying
that the approach on the left
is the wrong approach.
The approach on the right
is the right approach.
So what is the map
heavy approach?
This is the classical way.
Many demonstrations
that you see out there
are using this approach.
So what is this approach about?
Use a 3D sensor,
use a laser scanner.
It uses a Lidar to find the
vehicles and pedestrians.
And these are placed in the 3D
coordinate system of the car
because these are 3D sensors.
So a laser scanner will
give you a cloud of points
in a 3D coordinate system.
Then what you do, you localize
the car in the high definition
map.
So somebody built you
a high definition map.
The classical map-makers,
with the traditional methods,
they built you a high definition
map of the area around you.
You localize yourself
in the high definition
map using again the data
from the laser scanner.
You have a cloud of points
from the laser scanner,
the high definition map also
includes a cloud of points.
You do this matching.
And you localize yourself
in this high definition map.
Once you localize yourself
in this high definition
map you take the road users
that you detected and you simply
put them on the map.
So now you have all that
drivable paths from the map.
You have the road users
placed on this map,
and you're done with.
If you'll recall the first
Google vehicles out there--
they had a camera only
to detect traffic lights.
They needed nothing more
than a laser scanner.
So it's a very cool thing.
You put your road
users on the map,
you localize
yourself on the map.
You take simply all the
semantic information
about the roadway from the map.
You don't need sensing
at all because somebody
built you the map.
How that someone built the map?
You don't care.
It's not your business.
They put a lot of
manpower and so forth,
but it's not your business.
And then you inherit--
you have now a
unified coordinate
system with road users
and semantic information
about the roadway.
And you simply control the car.
There's the issue of
how you control the car
with the driving policy.
But let's put that aside.
We're talking about sensing.
So this sounds like
a very cool approach.
And now if you
have other sensors
now you want to enrich
and to make it more robust
by adding other sensors.
So other sensors would
be cameras and radars.
You need to make sure that those
other sensors would be talking
to you in the same
coordinate system-- would
be talking to you in a
3D coordinate system.
Now this is hard.
Projecting from 3D to 2D--
going from 2D to 3D is hard.
So camera is 2D, and
radars are also 2D.
It's a different kind of 2D.
But it's a two dimensional
piece of information.
So taking other sensors
and putting them
in this coordinate
system is a bit tricky.
What is the map light approach?
The map light approach is
much more difficult to do.
And then I'll go into
the pros and cons.
The map light approach is to
use cameras to detect the road
users-- the vehicles and
pedestrians-- and the roadway
information simultaneously.
Because the camera is the
only sensor that sees them
both in the same
coordinate system,
the 2D coordinate system.
And this is what I had
in the previous slide.
I found all the road
users, the free space,
the drivable paths
using strong perception.
I used a lot of computer
vision and I put them
all in a 2D coordinate system.
We're not yet in 3D.
Now you localize yourself
in this high definition map.
So that was the movie
I showed you before.
You use landmarks, you localize
yourself in a high definition
map that you built using
computer vision beforehand.
Once you do that you can
bring everything to a 3D
coordinate system
because the map
is in a 3D coordinate system.
But now when you put things
in a 3D coordinate system,
if you have errors, the
relative information
remains the same because the
road users and the roadway
work together in the
2D coordinate system.
So if you have an error,
it's going to be uniform.
It's not going to be different
error to the road users
and different error to
the lane information.
So the relative
information remains intact.
And this is critical.
The relative information
remains intact.
And now you have a unified
coordinate system--
3D the coordinate system-- and
you can then control the car.
Now if you want to add
additional sensors like Lidars
and radars, you need to
do a 3D to 2D projection.
And that is easy.
Doing a 3D to 2D is easy.
So here's an example of
projecting laser scanner
data into an image.
This is an easy problem.
Projecting radar
data into an image.
This is an easy problem.
So now let's look at
the pros and cons.
This is a clip showing how
things look like in a top view.
This is a 3D coordinate
system because it's a top view
just by looking at
camera information
and the high definition map.
So let's look at
the pros and cons.
The real advantage
of the map heavy
approach-- you can
do rapid prototyping.
That means if I
want to create a--
take a team of engineers
and within six months
do an impressive demo--
I can do that using
the map heavy approach.
I go and buy a high definition
map, a laser scanner--
detecting road users
using a laser scanner--
a few months of work
I can do something.
Especially for what I
want to do is a demo.
And then I'm done with.
I don't need to do
much more than that.
I need to control the vehicle,
I can do basic control.
And all of a sudden,
I can be on the news
that I'm a player in
autonomous driving.
OK?
So this is one advantage.
If you look at the disadvantage
of this approach here,
the high definition map become
a single point of failure
because it all depends on
this high definition map.
There's no redundancy.
The driver will pass
and the road users
are in different
coordinate systems,
they don't live together.
So it creates errors when
you are doing this in 3D.
And then the biggest problem is
creating these high definition
maps-- it is not scalable.
There isn't a way to create
them in a low cost manner
because I'm relying
that somebody else built
me this high definition map.
I'm not answering the
questions from end to end.
How do I make a
system, including
the creation of
high definition map,
that will be very,
very low cost?
Because in the
automotive industry,
if something is not low cost
it will not materialize.
And this is something that
sometimes players in this field
do not realize.
So what would be
the advantages here
is that first of all, as I
said before, the camera is
the only sensor where
you have both the road
users and the
roadway information
in the same coordinate system.
This is very critical.
The creation of
the high definition
map and the localization is
using the same technology,
the same computer
vision technology,
so I can crowd-source it.
This is also very important.
And also I can have systems
which are low cost systems
without a laser scanner.
Now there is a lot of promise on
laser scanners that in a decade
from now will cost $200-300.
Today they cost many
thousands of dollars.
It is nice, but people
forget that $300
in the automotive industry
is hugely expensive.
A camera module cost about $20.
So we're still 10 times more
expensive than a camera.
So if I want to
make a living I need
to make sure that
I have an offering
to give without laser scanners.
And in the map light
approach the laser scanner
is simply another sensor
to robustify my sensing--
to robustify my
interpretation of the world.
It's not a critical element
in the entire thing.
And then the cons is this
is very difficult to do.
This is not something to do in a
six months effort to do a demo.
This requires real commitment.
It's years of work
because there's really
strong perception going
on here because I'm really
solving things from end to end.
I'm handling the map,
handling the sensing,
handling the projection
onto 3D, and doing it
in a way that is low cost.
So these are the two approaches.
Let me go into the third pillar.
So sensing and mapping--
we're not going to talk
about this anymore.
Let's go to the third pillar
which is the driving policy.
And this is where most of
the intelligence is located.
So sensing there
is intelligence,
as I said, locating
objects is something
that we know how to do today.
It used to be a challenge 10
years ago-- five years ago--
today it's not much
of a challenge.
The drivable path
is a big challenge.
And this is where some
intelligence is hidden.
The biggest place
where you need AI
is in this area
of driving policy.
It's a negotiation.
And to show you that
this is an open problem,
this is something that was--
I took it from a year ago--
talking about autonomous
test vehicles--
autonomous cars are
really clogging traffic.
They're driving
too conservatively.
We see that this has nothing
to do really with Google,
Google is a great company,
but we saw this also
with Uber vehicles.
They had test fleets
in Pittsburgh.
And reporters would report
back what they sense.
And you clearly
see that whenever
something
semi-complicated happens
the driver behind the steering
wheel needs to take over.
So let me show you two
clips on my way to work.
And on my way to work
this is Jerusalem
and it's very, very
similar to Boston.
So driving in Jerusalem, driving
in Boston, is very similar.
So just to capture the
complexity of this.
So let let's run this clip.
So if you look at this vehicle--
so it's squeezing itself in.
This is the first thing.
Now let's look at this one.
This guy is not
going to succeed.
And at some point we're going
to fast forward this clip.
We're fast forwarding and
this guy is not succeeding.
[LAUGHTER]
AMNON SHASHUA: So
negotiation can fail.
It doesn't mean that we're
going to have an accident,
it means that we can
fail in our desires.
Let's look at this one here.
This is a real challenge
because it's long.
Imagine the lengths of
planning that's going on here.
Because at some point we're
going to fast forward this.
Still this guy-- poor guy--
is working his way in.
OK?
This guy also is
going to squeeze in,
it's going to be difficult.
[LAUGHTER]
AMNON SHASHUA: So do you
think that any autonomous
car out there can do
something like this?
No way.
OK?
Let's look at another example.
Let's look now at
the concrete example.
The concrete example is
called a double lane merge.
So double lane merge you
have, as you see here,
you have vehicles from
this path coming in,
vehicles from this path.
Vehicles can cross or can
stay in their own path.
And what makes it challenging
is that there are no rules.
The only rule is
don't make accidents.
So they could be
deadlocks here because you
may be interfering with the
plans of the other driver.
It's not just squeezing in.
So it's a complicated
negotiation.
And so people normally say
that the four way stop is
the most challenging maneuver.
But no, it's not a
challenging maneuver
because there are rules.
There is the right of way rule.
So you need to respect
the rule and squeeze in.
So it's not that challenging.
Here because there
are no rules it's
very, very, very challenging.
And the classical
way of doing this
is opening up a tree
of possibilities,
and traversing this tree.
And this tree
grows exponentially
with the planning time.
And you need to plan long as
you saw in the previous clip.
And the number of agents
around you, and so it's
really a very time consuming
computationally intensive
problem.
And people find all
sorts of [INAUDIBLE]
and traversing this tree.
And it's a big issue.
So let's look-- show
an example here.
And I'll show you
in this example--
So this is a double lane--
a double lane merge.
We have a path here, path here,
and cars can stay or change
lanes.
So this is just to give you
a sense of this double lane
merge.
And in a moment we'll
see one challenging.
So this guy rather than
starting here, he started here,
and then he's going to cross
upsetting everyone else
while doing it.
And the next one is not rare.
This thing here,
let's focus on this.
Let's see what's
going to happen here.
And this is the deadlock.
What we don't hear them cursing
each other or something.
[LAUGHTER]
AMNON SHASHUA: So it's
a challenging problem.
And we would like to
use machine learning.
Now normally everything
is machine learning today.
So what's the big
message of we would
like to use machine learning?
Well there is a
big message here.
And there is a reason why
machine learning is hardly
employed in this problem.
So to understand this
remember machine learning
is a data driven approach.
So rather than-- we say
it's easy to collect data--
rather than understand
the underlying
causes behind the problem
that we want to solve.
So it works perfectly fine
for pattern recognition,
for natural language
processing from voice to text.
There are many areas in which
if we collect a lot of data
and feed it into a black box--
a machine learning black box--
we solve the problem.
But what is the downside?
The downside is that
machine learning is based
on the statistics of the data.
So there could be corner
events-- rare events.
And in order to cover
these rare events
we need lots and lots of data.
Because we're doing stochastic
methods, stochastic gradient
descent, we'll need to
run through the data
many times until we flush
all the rare events--
the coronary events.
So this is kind of the downside.
So when you talk
about sensing, sensing
is a classical
area where you want
to use machine learning because
you are sensing the present--
planning for the future--
sensing the present.
The technology
that you are using
is deep supervised learning.
And that's fine.
When you are doing
driving policy
you're planning for the future.
And that technology is
reinforcement learning.
Now there is a big
difference between these two.
And the difference is the
way we use the training set.
So let's look at
the case of sensing.
So when you're
doing sensing, when
you're using
supervised learning,
our actions are predictions.
Our predictions do not
affect the environment
whether we make a
correct prediction
or a wrong prediction, we're
not affecting the environment.
This means that we can collect
all our training data offline.
We collected once and we simply
run through the training data
again, and again, and again
to flush all the corner cases.
In reinforcement learning,
or in driving policy,
our action affects
the environment
because we decide
to change lanes,
we decide to accelerate,
we decide to slow down--
we are affecting
the environment.
So if I change my
driving policy I
need to collect the data again.
I cannot do a collection
of data offline.
Every time I change
my driving policy
I'll need to collect
all the data again.
And because the rare
events in driving policy
are the accidents--
these are the rare events--
I need to collect a
lot of data in order
to cover these rare events.
Every time I change
my driving policy
I'll need to collect this data
again, and again, and again.
Therefore it's not a
positive proposition.
It's not an attractive
proposition.
And people do not use machine
learning because of that.
So the challenge is
how to guarantee safety
in machine learning technology.
And we solved this.
We have even a paper about this.
The paper is about
separating safety
where you have a mathematical
model of safety that guarantees
that there is no accidents, and
leaving the machine learning
only for the desires.
Therefore the rare
events are considerably
reduced because normally the
rare events are the accidents.
And you have a model that
guarantees that you're not
going to have
accidents, therefore,
you can focus only
on the desires.
And desires are not necessarily
met as we saw in the examples
that I gave.
So let me show you a simulation.
So in this simulation we
have eight cars randomly set.
The red car should go right,
the white car should go left.
And I'm going to show
three such initializations.
And you can see this is
quite complicated maneuvering
going on.
We can measure the
following statistics.
We can measure what is the
probability of an accident.
Well it is built
in a way that it
should be zero probability
of an accident.
So if we have an accident
it's a bug in the system
because the model will guarantee
it will not have an accident.
Then we can measure what is
the percentage of success,
because we know that we may
not succeed in our maneuver.
There could be cars
that will not find
their way in the right path.
We would like a very high
percentage of success.
Third, we can ask ourselves,
what is the computing
time of all of this.
Because normally
we're led to think
that the computing time
of something like this
grows exponentially.
Therefore, the computing time
is going to be significant.
So when you run this say 100,000
runs we have 0% accidents.
That means we don't have bugs.
And out of the 100,000
runs only 200 failed.
And when you look
at these 200 you
see that even a human
would find it very,
very difficult to accomplish
because the cars are
placed randomly.
And sometimes the
cars are placed
towards the end of the
stretch, there's not
enough time to do the maneuver.
And in terms of the computing
time it takes about 1%
of the computing
time of sensing.
So it's something that
can be on the same chip
on the same processing platform.
This is another example.
This is one slide
before the last.
So this is kind of
pushing your way through.
So all agents here--
all agents here
have been trained
with this driving policy.
And you can see these are
kind of complicated maneuvers.
OK so if I summarize, again
there are these three pillars--
sensing, mapping,
and driving policy.
So to do sensing right you need
to solve an AI problem which
we call strong perception.
This is about understanding
image in, story out-- not image
in, objects out-- image
in, they kind of curve out.
Image in, story out.
This is kind a
strong perception.
Mapping done right-- you need
to use this strong perception
in order to build the
building blocks to build
a high definition up
and you use perception
in order to localize
yourself in this map.
And then driving
policy done right--
you need human
level negotiation,
but on the other hand, you
need to guarantee safety.
OK so these are kind of going
from easy to too difficult.
So when we say that
2021 is the year where
self-driving cars from a
technological perspective
will be on the road it is
because not all the pieces
are there yet.
We still need to work on pieces.
I believe that we
were not waiting
for a scientific
revolution, it's
only technological revolution
that we need to have.
And therefore a few years
is something that is OK.
If you are waiting for
a scientific revolution,
it could take 50 years.
Who knows when a scientific
revolution will hit us.
But the building
blocks are there,
it's just putting them together.
So this is the end of
the kind of formal talk
and we can have the
floor for questions.
We can start with
the flyer maybe.
So about privacy-- so
the issue of privacy
is in what I told you is
about this crowd-sourcing.
How do we create the maps?
We use crowd-sourcing.
And privacy is a big issue.
And the way it is
going to be done
is all the data is anonymized.
Because it's not that we need
to know where Joe and Moe have
driven, we need to
aggregated data in order
to understand the
patterns of driving,
in order to build the maps.
So this data will be anonymized.
And if I compare this
to Facebook and Google
that I carry in my
pocket we're talking
about something which is much,
much more milder than what we
have today in terms of privacy.
OK so we can start with Q&A.
[APPLAUSE]
PRESENTER: If you
do have questions,
please find one of
the microphones.
THOMAS APAGO: Yeah, if you can--
go ahead.
AUDIENCE: Hi, there.
Thank you for the talk.
That was wonderful.
I was wondering what you thought
the most pressing challenges
in reinforcement learning
are right now relating
to the self-driving problem?
AMNON SHASHUA: So the
biggest challenges
of this driving policy--
we use reinforcement learning.
Unlike what you may think
reinforcement learning
is a much more
challenging problem
to get good solutions
than supervised learning.
With supervised learning,
when we build a deep network
there are all sorts of
magical things going on there.
We can build a network that
has much more parameters
than the number of
examples that we are using.
They normally converge
to a local minimum, which
is a very good local minimum.
And then of course,
there is lots of research
to try to understand
why this is so.
But from a practical
point of view
these networks work
very, very well.
This is not the case of
reinforcement learning.
If you try to re-create papers
using reinforcement learning
to solve all sorts of
interesting problems
you'll be greatly disappointed.
So there's lots of tuning
to understand what works
and what does not work with
reinforcement learning.
And reinforcement learning
is really the bedrock
of solving the driving policy.
AUDIENCE: Can I ask
about in terms of 2021,
I think a lot about unprotected
left turns in the U.S.,
or say unprotected right
turns in the UK or Japan.
Do you think that's
achievable in 2021?
What are some of the
challenges for situations
that involve no traffic signals,
vehicles coming at high speed,
and sort of the negotiations
with other drivers.
Do you think we can do that?
AMNON SHASHUA: I
think we can do that.
It's all about
guaranteeing safety.
But I need to qualify this, that
means for this industry to work
we cannot assume zero accidents.
There's no such thing
as zero accidents.
Zero accidents mean
that we don't drive--
simply stay there, stay
put and don't drive.
So I would compare it to
the industry of airbags.
So we all know that
airbags save lives.
What you may not know is that
airbags also kill people.
When they deploy at the wrong
time, at the wrong speed,
at the low speed, you hit the
curb and all of a sudden the
airbag deploys and
breaks your neck.
It happens.
And it happens every year, you
simply don't know about it.
And society has learned to live
with it because on one hand--
because the chance of
something going wrong
is infinitesimally small,
and on the other hand
it saves millions of lives.
And society knows
how to live with it.
The same thing could happen
with autonomous driving.
If one can show that the
chance of something going wrong
is infinitesimally small.
So let's try to
think about this.
What does that mean?
Take for example the U.S. They
have about 35,000 fatalities
every year due to accidents.
If we can reduce this by
three orders of magnitude,
let's say 45 fatalities a year.
And these 45 fatalities
would be because of something
went wrong in an
unprotected left
turn or something like that.
AUDIENCE:
AUDIENCE: Are those
the kinds of numbers
you're imagining in 2021,
or that's hypothetical?
AMNON SHASHUA: This
is hypothetical,
I don't know now what
we're going to reach.
[INTERPOSING VOICES]
AMNON SHASHUA: In order
for society to accept it
one will need to prove that
you can get to that point.
So in 2021 you're
not going to see
autonomous cars driving without
a driver behind the steering
wheel.
It will take years of
collecting data, making sure
that these vehicles
are safe, and answering
this question-- what is the
probability of an accident.
And if one can
answer this question
that probability of
an accident went down
by three orders of
magnitude, maybe 2 and 1/2--
maybe 100 fatalities
a year could
be acceptable to society--
then this could be acceptable.
So it is not that the goal
is to reach 0% accidents.
The goal is to on one hand
flow in traffic like humans yet
have a model that
guarantees safety.
At least safety in
the sense that it's
not through your actions that
an accident has been created.
I mean if I drive and somebody
hits me from the side,
there's nothing I
can do about it.
So this is an answer
to that question.
AUDIENCE: Thank you.
AUDIENCE: Hi, my
name is Sean Jane,
I'm a student here
studying computer vision.
Thanks for the talk.
I was wondering how does
Mobileye protect its IP given
that this is such a hot field,
and employees are poached.
Or employees leave and
form their own companies.
AMNON SHASHUA: OK, that's
a very good question.
So we do all the things
that the companies do.
We have patents, we
have trade secrets.
But we have something that you
don't have here in the U.S.
We work in Israel.
Israel is a bit
different, people
are much more loyal
to their organization.
[LAUGHTER]
AMNON SHASHUA: We've
never had an employee
leave and move to
another company
that's competing with us.
And we are already
18 years in business.
We have never had a
knowledge leakage, which you
find a lot in Silicon Valley.
People move from
company to company,
from company to competitive
companies and so forth.
So I think being in Israel
was really a blessing.
Now that will be
parting from it it
will be my challenge
how to preserve this.
This is how we
protect ourselves.
[LAUGHTER]
AUDIENCE: Hi, I think
in your examples
you mostly were
describing situations
where the cars don't
talk to each other,
especially in driving policy.
How much of a
difference would it
make-- either some
lightweight discussion,
communication between the cars?
And what percentage--
does it have to be 100%?
Or do you get
significant benefit
if some percentage
of the cars can
communicate with each other?
AMNON SHASHUA: OK
so you're talking
about what is known in
the industry as V to V--
vehicle to vehicle
communication.
And vehicle to vehicle
communications is a good idea.
What it can give
you-- it can give you
ability to detect
an object which you
don't have a line of sight to.
Because object that you
have a line of sight
you have sensing information.
Now is this a
necessary condition
for autonomous driving to
detect vehicles that you
don't have a line of sight to?
The answer is no.
Because if the answer
was yes we would all
stop doing any activity in
autonomous driving because V
to V communication-- ubiquitous
V to V communication--
that all cars have the ability
to communicate and send
their precise
location is not going
to happen for the next
two or three decades.
Once it starts it'll
take a long, long time
until all cars would have
this V to V communication.
So I think V to
V is a good idea,
but it is kind of orthogonal to
the activity of the autonomous
driving.
Humans can drive without having
this superhuman capability
of detecting other road users
without having a line of sight.
And if we are not drunk driving
and then if we are responsible
we have very, very small
percentage of accidents.
And we would like
robotic cars to achieve
the same level of performance.
And it is possible because
we have a proof of concept,
and that is humans.
AUDIENCE: Hi, thank you, Amnon.
So I have a lot of--
thank you very
much for the talk.
It's very helpful for this
audience and also for me.
And I have a question
about the perception.
So we know that, as you said,
the perception right now
is focusing on the
object detection that
provide a bounding box
for each type of objects.
And we know that there will
be two type of errors--
false positive and
false negative-- right?
Right now your product mostly
focusing on the safety feature.
Which is OK to have
some false negatives.
Like the missing
detection, because people
may not notice that.
But if it comes to
the self-driving car,
we have to have a system
that absolutely no missing
detection, because any
missing detection would
be a fatal crash.
And we know that Tesla
had two deadly crashes one
with a big truck--
a white truck.
And now there is in China--
not many people know that--
is crashing to a special trash
truck.
And so to my understanding
it's really hard
to prevent accidents
with these kinds of--
types of objects.
For example the cars
with special paint
or the people carrying a tree--
or a Christmas tree,
something like that.
So I think we may need a more
general object detection rather
then like these type
of supervised training
on a specific type of object.
I would like to ask
your idea about that.
AMNON SHASHUA: OK, so
it's a great question.
Every time Tesla is
mentioned, I get I get upset.
[LAUGHTER]
AMNON SHASHUA: And
I will explain why.
AUDIENCE: Sorry.
AMNON SHASHUA: But
it's a great question.
So driving assist
today is technology
to prevent accidents.
Now the driver is responsible.
The driver is holding
the steering wheel.
The driver is driving,
the driver is responsible.
So what you really want to
optimize in a driving assist
is to have zero false positives.
You are willing to have
a certain small level
of false negatives.
But you want to have
zero false positive,
because imagine you are
a layman driving a car
and all of a sudden the
car has an emergency
braking for nothing.
There was a shadow on the
road and your car stopped.
You'll simply take that car
and return it to the dealer.
You're not going to
drive this car again.
So you really need to reach a
zero level of false positives.
Then maybe it will happen once
in the lifetime of the car,
or something like that.
When you're talking
about autonomous driving
you have to have zero
false positives and zero
false negatives.
The way you reach zero
false negatives is you
have multiple sensors-- you
have multiple modalities.
Whereas in driving assist
you mostly have one sensor,
the predominant
sensor is the camera.
In some cases, in
premium cars you
have a camera and the radar.
In autonomous driving you have--
we're talking about every
area on the field of view
to have at least two
sensors and modalities.
Now the crash--
the Tesla crash had
nothing to do with this false
negative false positive.
There was a NHTSA crash report.
And the NHTSA crash report
said what we said earlier,
is that the crash
happened outside
of the design of the system.
The system was designed
for rear end crashes.
This is driving assist.
The accident was t-bone.
Now the sensors of
the car were not
designed-- especially
not the camera--
designed for t-bone detection.
It's designed for
rear end crashes.
Now that does not mean that
we cannot do t-bone detection.
But in the system of Tesla
there was no t-bone detection.
Tesla came out with stories
about a white rock and sun
and so forth, and so forth.
This made us very upset.
Then there was a
NHTSA crash report
which said this was
outside of design
parameters of the system.
We're not talking about
limitations of sensing.
Sensing-- and the
way we do sensing--
use deep learning, use
data driven techniques,
use multiple sensors.
You can reach 0% false positives
and 0% false negatives.
And this is the easy problem
among all the problems
that I mentioned.
People tend to focus on that
because you can tell stories.
I'm holding an umbrella,
I'm waving my hand.
All of this is easy problems.
None of them are
difficult problems.
The difficult problems
are the other problems
that I mentioned, which
people don't talk about.
AUDIENCE: Yeah, sure.
Now can I ask another question
about the hard problem?
So you talk about
the driving policy.
And you share that
very cool demo
that the car can negotiate
in that double merging case.
But I wonder--
I have a question
that how it can
transform to the real scenario?
Because the challenge for
the reinforcement learning
is that what you train is
how to react to yourself.
Like the other car
was the same policy.
But for the human it may
have a different kind
of model, a different
kind of mental state.
So it may have a different
policy, different decisions.
AMNON SHASHUA: It
is a great question
because I didn't touch about it.
What I talked about is the
robotic car driving policy--
having a model which
will guarantee safety.
And how would one use
machine learning such
that we can guarantee
safety on one hand
and drive like humans
on the other side.
But now comes a new question,
how do I validate this?
So let's assume
that I'm a regulator
and there is an operator that
wants to put self-driving cars.
And I want to measure the
probability of an accident.
How do I do that?
I cannot do it on a test truck.
What I'm going to
drive in a test truck
and say everything is OK?
Test track doesn't reflect the
complexity of the real world.
Am I going to drive around
my block one million miles
and say, I drove one million
miles and everything is OK?
This is what people
do by the way.
How do you go and validate this?
So one way to validate is more
or less what you are saying.
Let's try to build a generic
model of how people of humans
drive.
We're talking about
human driving policy
not robotic driving policy.
Let's do something similar
in the area of pictures.
We have these gun--
the generative adversarial nets
that create realistic pictures.
Why not try to create
realistic trajectories
by collecting a
lot, a lot of data
and let's create realistic
trajectories of how
a human drives.
Using the high definition
maps create a computer game
where we have agents
driving on realistic roads.
And the trajectories
of their driving paths
are mimicking human
drivers, including
reckless human
drivers, and so forth.
And then we take our vehicle
with our robotic driving policy
and now we drive
in the simulator
and we drive millions of times--
infinite number of times.
And we need to prove that
we don't have accidents.
And this is still
an open problem.
And this is this would
be the way, I believe,
in which these technologies
would be validated.
Otherwise how do
you validate this?
Wait years until
you show that you're
testing a fleet of
1,000 vehicles and all
these measures of how much
time I hold the steering wheel.
It's all very, very misleading.
Because I can drive in simple
areas for one million miles
and show that I don't
touch the steering wheel.
And I will avoid going
into complicated areas
because I don't want to
do mess my statistics.
Right?
AUDIENCE: That's right.
AMNON SHASHUA: So this
is an open problem.
AUDIENCE: But I think--
AMNON SHASHUA: But I think
others want to ask questions.
[LAUGHTER]
AUDIENCE: Sorry.
In Thank you very much.
AMNON SHASHUA: Yep, this side.
AUDIENCE: Hi.
Thanks for the talk.
I wanted to know
what differentiates
the big players in the entire
autonomous driving industry.
And, for example, how does
Mobileye differ in how
the engineering and technology
from other companies?
Is there generally a
metric for measuring
like which companies are doing
better than other companies
right now?
AMNON SHASHUA: Well, autonomous
driving is not out there.
So all what we know about
our engineering efforts
are science projects basically.
So it's very, very difficult
to tell what others--
what the performance of others--
All what you are exposed
to are some test vehicles
that reporters are driving.
So you don't really
know what is out there.
But you do know that there
are different approaches.
And those are the
two approaches that I
mentioned-- the map heavy
and the map light approach.
I believe that we are the
only ones with the map light
approach.
And this approach is very
attractive to the car industry
because it really
fits in the way they
see the world in
terms of the cost,
in terms of the scalability.
It is the way that they
can leverage their size.
If you have millions of
cars, these millions of cars
can generate the maps.
Using crowd-sourcing you
reduce the cost of the map.
Everything fits in the way the
car industry looks at things.
In terms of performance there
is no production worthy vehicle
out there.
It's all testing vehicles.
So it's difficult to say who is
the king of the hill right now.
There is no king of the hill.
It's all science projects.
And it will take
a number of years
until we'll start seeing
these vehicles in production
at some point.
I open here a can of worms
in terms of the answer.
The way this is
going to unfold is
that it's not we're waiting
now until everything
is perfect in 2021 where
we would have mobility
on demand with vehicles that
can drive without a driver.
There are going to be steps
on the way in which you
are going to have like
a test on autopilot--
but better, and
better, and better.
The driver still
needs to be alert,
but you can get
performance which
is going to be very, very
similar to a performance
of true autonomous driving.
But you have to be alert
in order for suppliers
to be able to fine
tune the technology.
Because if you wait until
the technology is perfect,
it's not going to happen.
You need to test.
And you need to test not only
with a fleet of 10 vehicles
driving around the
block, you need
to be able to test using tens
of thousands of vehicles.
And that you can do when you
have vehicles in production.
So they are going to be--
and there are a number
of programs starting
from 2019 in which you would
get kind of a test auto pilot,
better than what it is today
by many car manufacturers.
You are going to see
2021 also vehicles
that are limited only
to highway driving,
but safe highway driving.
This is called level three.
So that one can have
tens of thousands
of these vehicles
also generating
data for the kind of driving
policy that I mentioned before
for generating data for
learning human driving policy.
It's not that quite, quite,
quite in 2021 all of a sudden
we're going to have
mobility on demand.
It's going to be
a phased approach.
AUDIENCE: Thank you.
AUDIENCE: Thank
you for your talk.
I think we could agree
that your camera is
a safety critical piece of
equipment in your system,
correct?
And then you used the
analogy of the map
as being the redundancy
for the camera.
Do you class your map data
as a safety critical piece
of equipment?
And treat it as such?
Will it have that
kind of requirement?
AMNON SHASHUA: Well, the map
of data is safety critical.
But it's not a single
point of failure
because you have
also the sensing.
As I said, the strong
perception-- the sensing
should be sufficiently advanced
and sophisticated to understand
the drivable paths
even without the map.
And then the map becomes a
redundancy to the sensing.
But the map should be
accurate all the time.
The way the map is
accurate all the time
is that you build it
through crowd-sourcing.
Therefore, you have millions
of cars always generating data.
So once something changes
in the environment
it's almost immediately
being changed in the map
and then being
transmitted to the cars.
So in terms of the communication
you have uplink and downlink.
The uplink is a simpler
problem, because we're
talking about 10
kilobytes per kilometers.
So if you're driving 100
kilometers it's one megabyte.
My smartphone sends
much more than that.
Then there's the downlink.
The downlink is you
are going to send data
say over 100 kilometers
square map data.
So that's not going
to be one megabyte.
We're talking about
5G communications
that will be coming out
in the next few years.
So when we're talking
about autonomous
driving on the downlink
you need to think also
in terms of 5G networks.
And then you can have
continuous updates of these maps
wherever you are driving.
And because you have millions
of cars doing crowd-sourcing,
the map should be
always, always correct.
AUDIENCE: Right.
I'm just trying to tease out the
distinction between a map which
has to be very good versus a map
that has to be something that
is life dependant.
AMNON SHASHUA: The map
should be very good.
AUDIENCE: But not
life dependant?
AMNON SHASHUA: It has to
be very good and live.
It has to be updated
all the time.
AUDIENCE: OK.
AUDIENCE: As a
Mobileye shareholder
and a Tesla shareholder,
I never understood
what really happened
at the divorce.
Is there anything you could
share with this intimate group?
[INTERPOSING VOICES]
[LAUGHTER]
AMNON SHASHUA: No, I
cannot share anything.
We had an ugly divorce.
And at some point we said we're
not going to comment anymore.
And then Tesla stopped
commenting anymore.
And now we're all happy.
So I'm not going to comment
any more about Tesla.
[LAUGHTER]
AUDIENCE: So I have a question
about the front facing camera.
So if all the
sensing technology is
very dependant on
this camera, what
will be the desired
specifications for this camera?
For example, the frame rate,
exposure, that kind of stuff.
Because I can
imagine when you're
driving the side will
be passing by very fast,
right, so if you
used a low frame rate
you won't be able
to capture as much.
AMNON SHASHUA: So
automotive cameras
are slightly different
from consumer cameras.
Automotive cameras you need
very, very good low light
performance.
The pixel sizes are much
larger than the pixel size
of consumer cameras.
So for example, my iPhone--
I think the pixel
size in an iPhone
is 1.6 microns pixel size.
The pixel size in automotive
cameras are 4.5 microns.
So they gather much more light.
So this is why the automotive
cameras have low resolution.
The most advanced cameras
out there in production
have 1.3 megapixels.
We're talking about
an order of magnitude
compared to consumer cameras.
But there are also
tricks that will come out
in 2019 of how to get
high resolution cameras.
And this is using binning.
So analog building is that
you're looking at super pixels,
say two by two pixels,
three by three pixels,
and treating them as a
single pixel in terms
of light collection.
And in analog binning you
can do this in frame rate.
So you are basically
trading resolution
with light sensitivity.
So when you have
enough light you
can have the full resolution.
You need more light, you'll
reduce your resolution
by having these super pixels.
And you can do it
at the framed rate.
So cameras starting with 2019
would have an eight megapixel
resolution, which
is starting to be
interesting from a point
of view of consumer cameras
with this analog
binning criteria.
But we're not talking about
only front facing cameras.
It's cameras 360 degrees.
It's about seven to eight
cameras around the car.
AUDIENCE: I see.
Sorry, I just have a
very quick follow up.
If you have a very
fixed frame rate--
for example, at
night there might
be a situation where the
car coming in front of you
has a very bright light.
So with a framed rate that would
actually saturate the camera.
So do you have some
sort of feedback--
AMNON SHASHUA: Yes, so there's
a lot of sophistication going on
in camera processing-- not the
computer vision, the camera
processing.
How to use multiple exposures.
So because we can change the
gain of the camera and the gain
curve at framed rate.
So we crank the camera at
about 60 frames per second
even though we want to work
at 30 frames per second--
in some cases 10
frames per second.
So the frame rate
depends on the task.
Not everything is done
at 30 frames per second.
Sometimes 10 frames per second.
But the camera runs at
60 frames per second.
And we're using multiple
exposures in order
to get this high dynamic range.
That's a lot of sophistication
on the camera control.
AUDIENCE: Thank you.
AMNON SHASHUA: OK.
AUDIENCE: Hi.
I have one kind of
medium length question.
So you mentioned that we could
have self driving cars by 2021
if you have the
technological revolution.
Do you think there could
be another technological
revolution after that to maybe
enable superhuman driving where
self driving agents don't
really drive like humans,
but they drive
better than humans?
Maybe by controlling
more surfaces on a car,
or maybe kind of making
use of the physics
behind it so that they
can just drive down at 200
miles per hour on a highway.
AMNON SHASHUA: It
is a great question,
because when I talked about
maps nobody asked me here, why
do we need maps to begin with?
Right, because we humans--
we don't need a map.
We need a navigation map
to know where to drive,
but we don't need a map
in order to drive, right?
And if we want to mimic
humans then let's go for it.
Why do we need a map?
The reason for
the map is that we
know that human
intelligence in terms
of sensing and
all around driving
is so high that we really will
need a scientific revolution
to reach that level.
Forget about all the hype
that people talk about AI.
We'll really need a
scientific revolution
to reach anything close
to human perception.
The map is a way
to lower the bar.
This is why we need a map.
We're giving the
system something
that humans do not have.
AUDIENCE: To make
life easier basically?
AMNON SHASHUA: It will
make life possible.
[LAUGHTER]
AMNON SHASHUA: Without it--
we're here at MIT,
we're talking truth.
Forget about all this hype.
We're very, very far
away from anything
that is even close to
human capabilities.
Very far away.
The map is a way
to bridge the gap.
And it's a huge gap.
This is why the map
is so, so critical.
AUDIENCE: OK.
Thank you.
AMNON SHASHUA: And let
me know when to stop.
OK?
[INAUDIBLE]
AUDIENCE: I'm the last?
OK, who cares.
[LAUGHTER]
AUDIENCE: So how
do you actually--
you have some desire and
you have some policy which
are kind of constraints
you need to satisfy,
and they use [? RL ?] so that
it will come up as a policy
to satisfy these constraints.
How do you actually do it?
AMNON SHASHUA: Well, read
the paper that we wrote.
But in a few words it is
very close to the AlphaGo--
the Go playing reinforcement
learning by DeepMind.
But there also you have
three of the possibilities
that you need to traverse.
You learn how to traverse these
tree using imitation learning.
And then on top of that
you are using reinforcement
learning to find the most
likely path along a longer tree.
Something very
similar is happening
in what we are doing with
addition where we have
a mathematical model of safety.
This is something that
was not done before
to guarantee that we'll
not have accidents.
And in that way we move
all those rare events
that we would need to
collect a lot, of data
in order to find
those rare events.
Because the model
guarantees that there is not
going to be an accident, we
can focus only on the desires.
AUDIENCE: So that's
orthogonal to the learning?
That's kind of pruning the
tree afterward as a constraint?
AMNON SHASHUA: It makes
the learning possible,
because otherwise we'll need
to collect a lot of data
in order to find those rare
events which are the accidents.
And we will have to
collect this data again,
and again, and again.
AUDIENCE: Cool, thanks.
AUDIENCE: Hi.
So you're talking
about millions of users
uploading the map
information and you
combine that information
in your cloud--
in your server site to
create a very precise map.
But what about those
places that few people go?
AMNON SHASHUA: By 2020 every
new car in Europe and the U.S.--
and I believe also in Japan
and in China-- every new car
will have a front facing camera.
You see that also the number of
chips that Mobileye is selling
is almost doubling every
year starting in 2012.
So in 2016 we sold
about 6 million chips,
or 6 million cars.
All right.
By 2020 every new car is coming
out with a front facing camera.
This front facing camera
will have the ability
to send the data to the cloud.
It is reasonable to say that
tens of millions of cars
will be sending data.
So there's not going to be a
place where no car is passing.
And if no car is
passing then you
don't need to have
autonomous driving there
because there is a reason
why no car is passing there.
[LAUGHTER]
AUDIENCE: But how
many users do you
need for a specific
position or place?
AMNON SHASHUA: It's
a good question.
Right now we're doing
it with five drives.
So five vehicles
would need to drive.
We believe we can
get it down to three.
But because it's a
crowd-sourcing thing we're not
that concerned about it--
whether it's five or three.
AUDIENCE: Thank you.
