[MUSIC PLAYING]
ADRIAN MACNEIL: Good
morning, everyone.
Welcome and thanks
for joining us today.
My name is Adrian Macneil.
And I'm a director
of engineering
at Cruise Automation.
So at Cruise I lead our
engineering productivity team.
And so we're responsible
for building tools
and infrastructure that
supports the unique development
workflows of building
a self-driving car.
If you're not
familiar with Cruise,
we're a self-driving car
startup here in San Francisco.
So we're backed
by General Motors.
We have also raised around $5
billion in additional funding
from SoftBank and Honda.
And our mission is to launch
a fully driverless car.
So if you're a passenger
in one of our cars,
there's no need for you to be
responsible for the driving
at all.
You're going to-- you don't need
to pay attention to the road.
You can sit back.
You can eat.
You can read a book, watch
a movie, get some work done,
or sleep if you want.
And our vehicles are powered
by a wide variety of sensors.
So we have cameras, LIDARs,
several different types
of radars, and also some
powerful onboard computers.
And our car needs to use
all of these sensors to--
first of all, it needs to
make a 3D model of the world.
So it needs to understand
its environment.
The second thing
is it needs to make
a plan for how it
wants to navigate
through this environment.
And finally, it needs to turn
this plan into the steering,
braking, acceleration commands.
So this is the control stack.
We're currently testing
our vehicles here
on the streets of San Francisco.
And probably one of the best
perks about working for Cruise
is that we have this
mobile app that employees
can use to request rides
in self-driving cars
anywhere in the city.
And testing in this
complex urban environment
provides really interesting,
incredibly rich data
that our cars and our
engineers can learn from.
What really excites
me at Cruise is
getting to work on
tools and infrastructure
that support the unique
development workflows
of our self-driving cars.
So this actually is
an example of one
of our internal
tools called WebVis.
And it's a 3D web-based
visualization tool
that allows our engineers to
understand any of our driving
sensor data, see exactly
what the car was thinking,
or seeing, or doing
at any point in time.
There's different panels that
they can turn on and off,
and different
overlays to understand
the internals of the AV stack.
This talk is specifically
about our simulation at Cruise.
And simulation is incredibly
important for building
a self-driving car.
It's one of the most
important pillars.
And it's also a super
interesting challenge.
So first of all, I'm
going to talk a bit
about what I mean
by a simulation
for self-driving cars
and why that's important.
And then we're going to talk
about the infrastructure that
allows us to run all of
these simulations at scale.
And finally, I'm going
to talk about the data
pipeline and
analytics that allow
us to understand
all of this data
that we get from
our simulations.
So first of all, why is
simulation important?
As I mentioned, we
test out our cars
on the streets of San Francisco.
And it's a super
challenging environment.
We encounter pedestrians
and cyclists, motorbikes,
double-parked cars, emergency
vehicles, construction zones
far more frequently
than we would
if we were testing our cars--
or when we test our cars in more
suburban environments.
And I'm going to talk a bit more
about this photo in a minute.
But first of all, I want to
show you a couple of examples
of the types of
situations that we
run into on a daily basis,
driving in San Francisco.
So the way our cars are
able to learn so quickly
is that we do things like this.
We take them in San
Francisco into the middle
of a school zone at rush hour.
And this is the
kind of situation
that's difficult even
for a human to navigate.
So it gives us
incredibly useful data
about not only how our cars
behave in this situation,
but also this is data that we
can train from and learn from.
We run into construction
zones pretty frequently.
So our car needs to be able to
handle a sort of dynamically
changing city environment.
Double-parked cars are a
very frequent occurrence
in San Francisco.
Even moving double-parked
cars in this case.
And I guess you can't see
here, but we're actually
stopping for a-- yeah.
And sometimes,
when we're getting
round double-parked
cars, we need
to come into oncoming traffic.
And we also need to understand
when people are yielding for us
and sort of expecting us to go.
And these are all things that we
got through fully autonomously.
Finally, when we're making
unprotected left turns,
we need to know how to
look for a gap in traffic
and how to clear
an intersection.
So in the early days of Cruise,
our development workflow mostly
consisted of taking
an example like this,
one of these scenarios,
and maybe something where
we thought we could have
done a better job at.
One of our engineers
would download this
to their computer.
They would be a run
through it locally,
maybe make some code changes,
or introduce a new ML model.
And then they would run it
through the CI and unit tests.
And once they thought
that that was successful,
they would deploy
it on the road.
And we would do that to
gather additional data
about how that code is
performing in production.
But this is a pretty inefficient
workflow for two reasons.
First of all, it takes
a really long time
to get feedback
to our engineers.
If every time we want to know
how our driving is improving,
we need to go out and
deploy it to a car,
drive around the road,
and just sort of hope
we run into more challenging
situations like this.
The second thing is, I
guess, most importantly,
if an engineer is only
working on a single scenario
locally and then making
changes to that code,
they may have made that one
scenario significantly better,
but they don't know
if they've made three
other scenarios slightly worse.
So it's really important
that our engineers
can test their code across
a wide variety of scenarios
and sort of understand
how our behavior is.
I'll give you one more example.
So this is us-- it
starts with us making
an unprotected left turn.
We've got pedestrians
that we need
to yield to and look for a gap.
And then we come across
a construction zone.
And we need to identify that
this construction worker here
is holding up the stop sign,
that he's expecting us to wait.
And then once he sort
of starts waving us on,
it kind of needs to
navigate through the scene.
And it also needs
to check that it's
safe to cross these
double yellow lines,
and check that we're not
worried about that other guys
sort of running
out in front of us.
Why is he standing on the road?
And this is another
example of why simulation
is really important, because
while we run into-- we come
across construction zones
every day in San Francisco,
no two construction
zones are the same.
And it's really
important that we
can understand not only
how our car behaves
in scenarios that we
have seen, but also ones
that we haven't seen.
And so with simulation, we
can take something like this,
and we can ask questions like,
what would have happened if--
what if this human was standing
more further out into the road?
What if the cones were
in a different place?
What if the
construction equipment
was in a different place?
So we can take one of
these scenarios and we
can create many--
potentially hundreds or
thousands of variations on it.
So why is simulation important?
It allows us to validate
our driving behavior
across a large database
of different examples
before we go
driving on the road.
We get a faster feedback
cycle for our engineers.
And we can understand how
our behavior changes as we
vary one of those scenarios.
I'm going to talk a little
bit more about that later.
So this brings us back
to the intersection
with the cable car.
This photo was not actually
taken in San Francisco.
This is from our simulated
driving environment
known internally as the Matrix.
And just like the
Matrix, the car
has no idea that it's inside.
So we are simulating
everything from the camera,
LIDAR inputs using ray tracing.
And then the car goes
through its normal process
of doing perception,
planning, controls,
and responding with steering
and braking commands.
And testing in a simulated
environment like this
allows us to reliably reproduce
weird edge cases and scenarios
that we haven't seen
in the real world.
In a simulated
environment like this,
we can control everything;
pedestrians, cars, cable cars,
even other Cruise cars.
And so simulation
like this is one
of the most important pillars
of self-driving car development.
We've got a bit of work to do
on our pedestrians walking.
So this is just one type
of example of simulation.
So this is what's known as an
end-to-end simulation, where
we're simulating everything
from our sensory input
through to the steering
and braking commands.
But there's many other
different types of simulations.
So another example
is a replay test,
where we're taking
actual sensor data
that we've recorded
from the real world,
playing that back
against the car stack,
and then we can compare
the results of this
to our steering.
We can compare the
results of what
we detected to human-labeled
ground truth data.
And that allows us
to score how well
the car did at
turning that sensor
input into an accurate 3D
model of its environment.
Another example was a
planning simulation.
So in a planning simulation,
we can construct an example
like an intersection here.
We have some pedestrians
entering a crosswalk.
The car is trying
to make a left turn.
And we can create an assertion
such as that the car must
yield to these pedestrians.
And the key difference with
this type of simulation
is that we're not simulating
sensor input here.
We're simply giving the car
a known state of the world,
and then we're
validating its behavior.
And this is much
cheaper to run if we
want to do hundreds or
thousands of variations.
This is not an exhaustive list.
There's many other different
types of tests and simulations
we run.
So for example, we also do a lot
of performance testing, where
we're running our car stack
against real car compute
hardware.
And that way we can benchmark
things like latency, CPU,
memory, GPU use.
So we have a huge number of
these different scenarios.
Every one of these
scenarios, we might
want to create hundreds
or potentially thousands
of different variations.
Each of these tests that we run
requires large runtime assets.
So sometimes, you
might be using over
100 gigs of data that needs
to be copied onto that worker.
And that's things like binary
assets, map tiles, ML models,
recorded sensor data.
And finally, these
tests produce really
detailed matrix and data.
So software in car simulation
is much more than just
a pass/fail result,
just like driving well
on the road is much more
than just not crashing
or just following
the road rules.
So our engineers really need to
understand exactly how our car
behaves in that simulation.
It's critical that we pair the
simulation with on-road driving
for a couple of reasons.
First of all, we want to do
on-road driving so we can
find new and unique situations.
And the second one is
we need to validate
that our simulations accurately
reflect the types of things
that we're observing
in the real world.
So by combining this on-road
driving in complex urban
environments with the simulation
maximizes the result--
the rate at which
our cars can learn.
So that gives you a sort
of a brief understanding
of what simulation means
and why it's important.
Next, I'm going to talk
about the infrastructure that
allows us to run these
simulations at scale.
Testing infrastructure at Cruise
is running at massive scale.
So every day we're running over
200,000 hours of simulations.
We do this on a cluster of
30,000 instances with 5,000
GPUs.
And you can think about this
as 30,000 virtual cars driving
around every day to
validate our cars' behavior.
And just to put that
number in perspective,
Cruise has around
180 test vehicles
registered with the
state of California.
So we're running over 100
times more cars in simulation
than we do in the real world.
We also produce nearly a third
of a petabyte a day of results
from all of these simulations.
If you asked me a year
ago, whether I thought
we were running
large-scale testing
infrastructure at Cruise, I
would have said, absolutely.
But the use of
simulation at Cruise
is rapidly expanding
as our engineers
find new and unique ways to
benefit from these simulations.
And so our test
infrastructure and the number
of hours of compute
that we've been running
has grown by over 25 times
in the past 12 months alone.
So you can imagine this provides
some pretty interesting scaling
challenges.
So we have really rapid
scheduling of jobs.
Sometimes, jobs
are changing state
over 1,000 times per second.
We need to deal with
CPU, GPU resource
availability to
make sure we have
quotas to support our demand.
We need to balance
our utilization
and make sure that we don't have
too many workers sitting idle
or that we don't
have jobs queueing.
We need to distribute
these job inputs.
100 gigabytes of data
needs to efficiently
be moved to potentially
thousands of machines
quickly and efficiently.
And finally, we
need a data pipeline
that allows us to make
sense of all of this data
that we generate
from the simulations.
We choose to run
this in Google Cloud
because it allows us to
scale much faster than we can
on our own physical
infrastructure.
So we do have some
on-prem hardware
that we use for testing on
real car compute hardware.
As I mentioned, that's great
for performance benchmarking.
But the vast majority of our
compute runs in Google Cloud.
One of the other
main advantages is
that Google supports these
custom machine worker types.
So we can choose exactly how
many GPUs, how many CPU cores,
and how much RAM we
want for an instance.
And this is really
valuable when we're running
different kinds of simulations.
We get access to the latest
generation GPUs and hardware.
And this is really beneficial
for the productivity
of our engineers.
And we're able to use the best
in class managed services.
So we take advantage of
Spinnaker, Pub/Sub, BigQuery,
and Google Cloud Storage.
I guess to-- before we talk
about our current architecture,
it's helpful to think about
where we've come from.
So in 2016, Cruise was
a pretty small company.
When it was acquired originally,
it was about 40 employees.
And we treated simulation
as fundamentally part of CI.
So we had a CI system that
launched VMs and containers.
And the CI system had a
homogeneous worker pool.
So we had to run all
of our builds and tests
on expensive GPU instances.
We were also limited to
16 times parallelization.
And that was due to some
performance bottlenecks
in the system.
So ultimately, this
resulted in engineers
spending a lot of time waiting
for simulations to run.
We solved this by introducing
a new tool called Hydra.
And Hydra is a job scheduler.
So it's called by a CI or
users requesting a job.
It consists of a master, and
then various worker pools.
And it can send jobs
to these worker pools.
On the worker, we use
Docker to execute the job.
And on the GPU workers, we're
using Nvidia Docker to provide
access to that hardware.
The Hydra master writes
all of its job state
and kind of results
of these simulations
into a postgres database.
And then as often happens
at early stage companies,
we ended up with a whole bunch
of dashboards and reporting
tools, which were talking
directly to that database.
As our load increased
though, it became clear
that this architecture was
not going to scale very well.
So there were performance
bottlenecks, especially
in the database with
locking and triggers.
And it was clear that we
needed to revise this.
For anyone who's made change
to a production system before,
though, you'll know that it's
much easier said than done.
So any API that
can be depended on
would probably be
depended on somewhere.
And we had things like
various reporting tools
that were talking directly
to these databases.
We had triggers that we
didn't want to interrupt.
We wanted to switch
over to the new system
without any downtime.
And we wanted to keep the
worker API backwards compatible.
So at the start of
this year, we migrated
to our new architecture.
And Hydra 2 is a collection
of individually scalable
components.
So we completely rewrote
the Hydra master in Go.
And that manages the scheduling.
And it talks via the exact same
API, backwards compatible API,
to our workers.
The scheduler has
its own database.
So it's using Cloud
Spanner to track
all of the current
in-progress job state.
And then from there, we
replicate the job execution
history, first of all, back
to that reporting database
that tools were depending on.
And then also through
Pub/Sub to BigQuery,
which is a much more
scalable data warehouse.
Hydra 2 was built
to take advantage
of all of the building
blocks that we
have available to us in GCP.
So we're using things like
Cloud Load Balancing, Kubernetes
Engine, and Cloud Spanner
to run the master.
We're using Container
Registry and Cloud Storage
for storing our inputs.
We have output data
going into Cloud Storage.
And then we're also using
Pub/Sub and BigQuery
to manage our output data
and our execution logs.
I'm going to talk through
each of these components.
But we're going to start with
the most interesting one, which
is the worker pools.
So our worker pools are
made up of tens of thousands
of instances.
And our goal here is
basically to recreate
this, an entire
room of servers that
are running for our benefit,
from the comfort of our laptop.
We do this using Google Compute
Engine, which is great product.
It allows us to spin
up a VM on demand.
And as I mentioned, one
of the great features
about Compute Engine
is that we can
choose exactly what ratios of
GPU, CPU, and RAM that we want.
We're not limited to these kind
of predefined instance sizes.
And so Compute
Engine has a feature
called Manage Instance Groups.
And this is similar to
autoscaling groups in AWS.
And this allows you to
create a template instance.
And then you can
scale up or down
many, many hundreds of instances
of the exact same size.
Managed Instance Groups
come in two flavors.
So there's Regional
Managed Instance Groups,
and these are tied to a
specific region, such as US-West
And they automatically scale
up instances and balance them
across all of the zones.
So in this example, we
have US-West as our region.
There's an instance group in it.
And then we're creating
all of the instances
in the three zones in US-West.
Unfortunately, when you use
Regional Managed Instance
Groups, it's very
aggressive at making sure
that you have balance of
the number of instances
across the different zones.
So this is primarily designed
for high availability
workloads.
If you're serving
customer traffic,
you want to have spare
capacity and make sure
that if requests are coming in,
you don't end up dropping one
just because a zone went down.
But for batch workloads, we
don't actually care about that.
We do care about,
though, the fact that--
in this example, I
guess, if we scale down
a couple of instances
in us-west1-a,
we don't want it to
automatically just start
killing random
instances in us-west-c
and start up new ones
in a just to maintain
that perfect balance.
If we're halfway through a
compute job, for example,
we might be halfway through
a 20 or 30-minute job,
and we really don't
want it to start
killing random instances that
are halfway through those jobs
just so that we could
have perfect balance.
So the other option
that you have available
are the Zonal Managed
Instance Groups.
And in a Zonal Managed
Instance Group,
it is only scaling independently
within a single zone.
So we can create one
instance group for each zone,
and we don't need to
worry about perfectly
balancing these workers.
Another challenge
that we have here
is that these
instance groups are
limited to running 1,000
instances per group.
And so if you talk
to Google, it's
possible to increase
this limit a little bit,
but not sufficient to run the
tens of thousands of instances
that we want.
So the way we solved this is
by implementing Sharded Managed
Instance Groups.
Basically, in each zone, we
would create more than one
Managed Instance Group.
And that allows us to
scale above 1,000 workers
in a single zone.
This does make autoscaling
a bit more challenging.
I'm going to talk a bit more
about autoscaling in a second.
Each of these Managed
Instance Groups
creates an instance
from a single template.
So if you want to run
different types of workers,
like CPU and GPU
workers, you need
to create more than
one instance group.
At Cruise, we also
have different types--
different size instances
for each one of these.
So on our CPU workers, some
jobs only need a few cores,
and some jobs can take
advantage of 96-core machines.
Likewise, some of our GPU
jobs only need one GPU.
Others can take advantage
of multiple GPUs.
And we obviously want
to optimize the cost
by not running workloads
on machines that are
overprovisioned for that job.
So bringing this
back to our diagram,
we now have multiple
availability zones
in the region.
Each of the availability zones
has multiple instance types.
And then each of
these instance types
has multiple managed
instance groups within it.
So at this point, we're starting
to get quite a few managed
instance groups up here.
But there's one more thing
that we need to consider.
And the final picture of
the piece of the puzzle
is which region to
select for the workload.
So it's a little bit--
it's actually a map of the
world, believe it or not.
And if you're
thinking about where
to run your compute workloads,
there are several factors
that you need to consider.
So you need to, obviously, think
about legal considerations,
if there are things
like data privacy
and where your data is
being stored and processed;
latency requirements, if your
jobs need to talk to data,
talk to each other or talk to
data in a different region.
You want to co-locate your
data and your compute.
So you don't want all of
your data stored in US-West
and then your compute
happening in US-East.
And finally, not all
regions are created equal.
So if you're doing things
like this, running thousands
of GPUs, you really want to
talk with your TAM or solutions
architect and get a sense of
which region is going to best
support those workloads.
In our case, the answer
was multiple regions.
So generally, when people are
running in multiple regions,
they're doing this, again,
for high availability reasons.
But in our case, we care more
about having the flexibility
that we can run our
workloads in whichever region
is easiest to get capacity.
So we have pretty spiky
workloads during the day.
Sometimes, an engineer
will kick off a job,
and we might need
thousands of more servers.
And being able to create
that in any region
gives us great flexibility
to be able to take advantage
of quotas in multiple regions.
Running in multiple regions,
of course, isn't free.
So there's additional
cost and complexity
with setting this up.
You need to think about latency.
For example, if you are running
machine learning training,
you would not want
your workers talking
across regions, because
that's going to be
quite a lot of extra latency.
But for simulation
workloads, they're
all running pretty
independently.
So that wasn't a problem for us.
You do need to think
carefully about where
your data is located.
As I mentioned, you don't
want your data stored in West
and your compute
happening in East.
So Google Cloud Storage
has this great feature
called multi-regional buckets.
And when you create a
multi-regional bucket,
you choose a
continent, like the US,
and then it
automatically replicates
your data across the different
regions in that continent.
So when you use a
multi-regional bucket,
you pay slightly
more for the storage,
but then it makes transfer
within all of those regions
free and fast.
So bringing this back
to our little diagram,
we have at the top some
multi-regional buckets.
So we're using Cloud Storage
and Container Registry
for storing our data.
We then have multiple regions.
Each one of these regions has
multiple availability zones.
Each one of these
availability zones
has multiple worker types.
And each one of
these worker types
has multiple instance
groups associated with it.
So we really end up
with a huge number.
You can see how
it's easy to end up
with dozens of Managed
Instance Groups doing this.
To manage all of this, we're
huge fans of Terraform.
So Terraform allows
us to easily define
the infrastructure as code.
And we can easily make updates
and roll out changes to this.
And Terraform modules
are a really great way
to build something like
this, because we can set up
a template for this is
what a particular instance
group should look like.
And then you can
use that to compose
what a zone should look like.
And you can use that to compose
what a region should look like.
And you can sort of
template and parameterize
each one of those modules.
When we run this
many instance groups,
we need to think carefully about
how and where we are scaling.
So a typical workday
at Cruise starts
with engineers submitting
jobs in the early morning.
It ramps up during the day.
You'll notice there's a bit
of a lull there over the lunch
period.
And then it tapers off
a bit towards midnight.
And we have some low-priority
jobs which can run overnight.
But the majority
of our engineers
are getting their work
done during the day.
And we don't want engineers
waiting on machines
if we can all avoid it.
So scaling like this is always
a balance between two things.
You have queue times
and utilization.
On the one hand, if we were
to scale up too quickly
or if we were to
scale down too slowly,
then we would end up
with excess capacity.
And this would mean our
utilization would be low.
We'd be paying for servers
that we don't need.
On the other hand, if we
were to scale up too slowly
or we scaled down
too quickly, then we
end up with jobs queuing.
And that's not ideal
either, because now we're
paying for engineers while
we're waiting for these jobs
to finish.
So we need to be able to
handle rapid changes in demand,
but we also need to
ensure that we're closely
matching this demand
curve, and that we're not
over or under provisioning
the number of servers we need.
We have some unique aspects
to managing this autoscaling
at Cruise.
So we need to balance the
scaling across multiple regions
and zones, and instance groups.
In some cases, we want to scale.
We want to prefer scaling in one
region and only sort of start
scaling up in another region
if we can't get capacity there.
We have different queues
with different priorities.
So if we have a
low-priority job,
we might not want
to scale up at all.
We're just happy for it to
kind of sit there and queue.
And we also need to avoid
interrupting long-running tasks
when we're scaling down.
And the default autoscale
that comes with the GCP
can target something like
a stack driver metric.
So you can target
something like CPU use,
or you can target your
own custom metric,
like queue length.
But unfortunately, when the
default autoscaler scales down,
it always picks a random
instance to terminate.
And when it does that,
it only gives you
60 seconds notice
that this instance is
going to scale down.
As I mentioned earlier,
if we're halfway
through a 20 or
30-minute simulation job,
we really don't want
to kill that instance
just because it got unlucky.
So to meet these needs, we
built our own autoscaler.
And I highly recommend
this if you're not--
if the stock autoscaler
is not working for you.
It's actually pretty easy
to build with just a few GCP
components.
So we have here a
Cloud Scheduler,
which kicks off a cloud
function every minute or so.
And that cloud function is
just a few hundred lines
of JavaScript.
So the cloud function is
doing a couple of things.
It reaches out to Stackdriver,
and it asks, first of all,
for the demand.
So how many jobs do we
have running in progress?
And how many jobs do we
currently have queued?
And then it looks at the
supply side of things.
How many workers do we
currently have connected
or that are
currently scaling up?
If it decides that we need to
scale up additional capacity,
it can do this really
simply by just reaching out
to the Instance Group
Manager API there,
and it can just say,
please set the target size
for this instance group to x.
If it wants to scale down, it
puts a message on a Pub/Sub
queue, and then whenever
a worker is idle,
it simply checks
this queue to see
if there's a shutdown message
for that particular worker
class.
And the unlucky worker that
drew the short straw just
deregisters itself from
the Instance Group Manager,
and then it
terminates gracefully.
So this architecture
allows us to ensure
that we're only ever terminating
workers that are idle.
Another thing that
we needed to consider
when running this many
instances is the networking.
So we run all of instances
in a private-- in a VPC,
and this uses
private IP addresses.
And when you're using
private IPs, that
means that if you need to
reach the outside internet,
then you need to
go through a NAT.
Google has a product called
Cloud NAT, which went GA back
in November.
But when we set this up,
Cloud NAT was not available.
So we actually had to manually
create a whole bunch of NATs.
And that meant we
needed to create
a NAT in every
availability zone,
basically, in every region.
And some availability zones
needed more than one NAT
to handle the bandwidth
throughput that we needed.
And this works, but it's
still infrastructure
that we don't want to manage.
So we're currently
in the process
of migrating to Cloud NAT.
And Cloud NAT can automatically
scale horizontally
to support your demand.
Also, if you're doing any
communication with private--
with managed Google
services, you
have to enable
Private Google Access.
So Private Google Access
allows your machines
that are running on private
IP addresses to route
directly to those services like
BigQuery or GCS or Spanner.
It allows you to route
directly to those services
without needing to
go through your NAT.
So it significantly reduces
the bottleneck on your NAT.
Private Google Access is
actually pretty complicated
to set up if you are using--
if you're managing your own NAT
instances.
But if you're using
Cloud NAT, it's
a checkbox when you
set up your VPC.
So check that box, and you
will live a happy life.
The final picture of the--
the final piece of the picture
here is storage.
So as I mentioned, we're
using Google Cloud Storage
for storing inputs
and output data,
and we're using Google
Container Registry to store
our runtime container images.
In Google Cloud
Storage, we have things
like application binaries, map
tiles, machine learning models,
and sensor data,
all of these things
that we need to
run a simulation.
We've gotten pretty amazing
throughput from GCS.
So this is our write throughput.
On a typical day, we
might be peeking at sort
of a gigabyte per second.
But we don't write nearly
as much data as we read.
So this is our read throughput.
And you can see on
a typical day we're
peaking at over 380
gigabytes per second of data.
So that's over 3 terabit.
And this is an example of where
the cloud is super powerful.
Because if we set out to
build infrastructure like this
ourselves, it
would be incredibly
challenging to manage
reliably at that scale.
And it would also not be
value add work for Cruise.
Instead, we simply click
a button on the console
or write a few
lines of Terraform,
and we have this kind of
capacity available to us
instantly.
We have a bunch of work coming
down the pipeline for Hydra.
So we're always making
performance improvements
to handle the increasing load.
We're looking at task pipelining
so that we can prefetch assets
for the next job that we're
about to run before we finish
the current job.
We're optimizing
where we send jobs
based on which workers
already have resources cached.
And we're A/B testing
out worker updates
so that we can make sure
that workers are working
successfully, and that
we haven't introduced
any regressions in our
work infrastructure
before we start managing
production load.
So the last thing I'm
going to talk about
is our data analytics pipeline.
If you recall from
this previous slide,
I mentioned that we're storing
our output data in Google Cloud
Storage.
And then we ingest
this into BigQuery.
And like with
Hydra, I think it's
helpful to touch briefly on
the architecture of where
we came from before I talk
about how we're using BigQuery.
So in this diagram, you can see
our simulation JobScheduler.
That's Hydra.
And Hydra is pulling
work from build artifacts
and from other simulation
data that we need.
And then the output
data that it produces
goes into our Graph
Compute Engine.
And the Graph
Compute Engine allows
engineers to run
post-processing tasks,
like they might want
to, for example,
compare that output
data against human
labeled ground truth data,
and create aggregations
on that to score how well
we did in that simulation.
The post-processing data was
then stored Avro files in S3.
And we chose Avro here
for a couple of reasons.
First of all, it was just
really easy to work with.
Hydra already had the ability
to save Avro files into S3.
And secondly, we didn't
need to worry about schemas.
So our engineers could make
changes to the AV code,
or they could make changes to
their post-processing steps.
And that data would
get written directly
to Avro files,
which they were then
able to access without
needing to go through any team
or get approval to
change the schemas.
We had tools like Jupyter
Notebooks, and also
custom analysis tooling,
which can directly
read these Avro files,
and then render them
to create sorts of dashboards.
But this architecture
became pretty
limiting once we started wanting
to run more advanced analysis.
Without a proper
query engine here, we
couldn't answer advanced queries
without going back and editing
the post-processing code.
And this also created pretty
significant memory use.
Our front end
visualization tools
needed to read these Avro files.
And it did not scale well
as the number of simulations
increased.
Finally, it's pretty
time consuming
building this type of tooling
if you're tying it specifically
to the data format
that you're using.
There are many
different ways we could
have gone about
solving this, but given
that we were using
Google Cloud, we
decided to use BigQuery
as the data warehouse.
So we have an ETL
pipeline here that's
a simple ingestion service.
And that's taking
the data produced
by our post-processing
Graph Compute Engine,
and ingest this into BigQuery.
We've had great success
with BigQuery so far.
So we're ingesting
over 10 terabytes
a day of data with no
administration overhead to us.
We can get this
data into BigQuery
within minutes of
finishing a simulation.
And we can query
it within seconds.
And the BigQuery SQL interface
is accessible and supported
by many different
reporting tools, not just
our in-house ones.
Our engineers can access
this data via the BigQuery
console and UI.
We also have things like
Jupyter Notebooks and BI
tools, like Looker and
Tableau, that can access it.
And we also have our own custom
front end analysis tooling.
So the last thing I
want to talk about today
is a bit about this custom
analysis tooling that we built.
I'll give you an insight into
the world of self-driving car
development.
So I'm going to walk
through a complete example
of a simulation
and how we run that
and how we understand
the results.
So here, bear with me here.
We have a car trying to make
an unprotected left turn.
And the car needs to
yield to these cars
in the oncoming lane of traffic.
So this is represented in pink.
We also have what we call a
region of interest, which is
the dotted line that you see.
And our car needs
to clear this region
of interest between that gap in
traffic between those two cars.
What we're interested in
this particular example
is which gap our car
chose to go through
and what the gap size was,
what the period of time was.
And the gap size is
the period of time
between when the car
one, the blue car,
has left that
region of interest,
and when car two, the
green car, entered
that region of interest.
So if you're still with
me, keep that in mind.
We'll move on to the next slide.
This is an example of some
of the custom visualization
tooling that we
built. So you remember
we're looking at the gap
size between the two cars
as we're making an
unprotected left hand.
And in this particular example,
we're comparing two factors.
We're looking at how the
speed of those oncoming
cars and the distance
between those oncoming cars
affects the gap that we chose
and the size of that gap.
And you can see here that
we're comparing a base
branch and a feature branch,
and then in the third box,
you can see the delta between
the base and feature branch.
So engineers can easily identify
with this particular commit
how that impacted
which gap we chose
as a function of these
two input parameters.
Each one of these
green squares here
is actually a separate
individual simulation.
So that's running on Hydra.
And that's generating probably
a gigabyte of output data each.
So you can see how we end
up with these large scale
compute and data
storage challenges
once we start running
these types of simulations.
Another thing about this tool is
that the engineers can actually
click on one of these
particular squares,
and it gives them
more information
about that particular run.
So we have a lot of ongoing
work in the simulation
tooling like this.
We work closely with
our AV engineering teams
to understand what type of
experiments they want to run
and how we can best of
understand and visualize
those experiments.
So thank you for joining today.
Before I go, I want to leave
you with three thoughts.
So the first one, simulation
for self-driving cars
is a super interesting problem.
It's a privilege to
get to work on it.
At Cruise, we're using the
latest tools and infrastructure
to build--
to test these cars
at massive scale.
So this includes things
like Google Cloud,
and also in-house
developed software.
And finally, it takes a
diverse range of skill
sets to build self-driving cars.
So we have not just machine
learning engineers and robotics
engineers, but back end
engineers, infrastructure
engineers, 3D front end
engineers, data, data science.
And we're hiring in San
Francisco and Seattle
if you're interested in working
on these types of problems
every day.
So thank you.
[MUSIC PLAYING]
