[MUSIC PLAYING]
NICOLAS GEOFFRAY: Well,
thanks for being here.
I'm impressed you're--
there are people showing up.
The Android Fireside
Chat was kind
of scaring me that everyone
would just avoid the session.
But no, it's great
to have you here,
and this talk is about ART.
So there's a part
of this talk, which
was supposed to be on
garbage collection,
and that my colleague David
was planning on giving,
but we're not going to
rehash the same thing.
So I'll put long pauses, awkward
ones I hope, during the talk,
so I can fit the 40 minutes.
So bear with me.
So given we're celebrating
10 years of Android, David
and I thought it
would be a good idea
to think about what we've
done the last 10 years
and how the Android
Runtime, which
is the thing we worked on for
a couple of years now, evolved.
So here we are.
So some of you went to
the chat [INAUDIBLE],,
so I guess you already know what
the-- what an Android Runtime
is in the Android stack.
But in case you don't,
it's that little layer,
that yellow one here, between
the Android framework,
like the Android
operating system,
and the actual
underlying kernel.
The Runtime runs both the
Android framework and all
of the apps, so everything
written in Java,
that's what we execute.
And so being so core
in the platform,
it becomes responsible
for a ton of things,
like the user experience could
be very bad if the Android
Runtime was not efficient,
and you saw that this morning
with how the GC was sort
of poor in the dark days.
So in this talk, I'll
show you over time
how runtime versions that--
the old iterations we've
made over the past--
they've oscillated
between, OK, what
do we need to improve
for this year?
And like I said, it's--
ART, the Android Runtime is
possible for a bunch of things,
and raw performance
is one clear one,
like how fast you
execute Java code.
But clearly, it's also
responsible for jank,
like the 16 milliseconds window.
If the runtime is not able to
execute the Java code of that--
of the app, well then,
we'll miss the frame
and produce a lot of jank.
Application startup--
there's a lot of Java code
that needs to be executed
during application startup.
Again, if the runtime is
slow, startup will be slow.
Boot times-- the Android
OS is written in Java,
so a lot of code, again,
executes during boot.
Battery-- if we're slow, we're
going to tank your battery.
And install time
is also something
we care about because we are--
when we get an APK, the
platform will optimize it,
and that could take a
long time, depending
on how we implement it.
And we don't want that long
time to happen because we want
you to use the app right away.
And the other two,
which is memory related,
is disk space, so
how much space is
the runtime taking for its own
optimizations, and then RAM.
So Java being Java,
there's allocation
that the runtime
needs to handle,
and if it doesn't do it well,
then it can take a lot of RAM.
So essentially, there's
been three incarnations
of the Android Runtime.
The first one was Dalvik.
It was the first implementation
that shipped with Android.
And Dalvik's purpose,
or Dalvik's main focus,
was how do we save RAM, and the
reason being, back in the day,
like 10 years ago, the
RAM we had on the phones
we were shipping was like
even less than 200 megabyte.
And that was very
little if you want
to execute the whole
event, the Android stack.
So everything Dalvik
was doing was about, OK,
how do we save on RAM?
So it could not
generate any code.
JIT or AOT is how
we generate code.
It could just interpret
the dex code--
the dex code being
the thing that gets
sent to Android for
execution of your app.
Eventually, it got a
just-in-time compiler,
so that we could generate
native code of the dex code.
But again, it was very
limited to what it could do,
because RAM was the main focus.
And its GC was tailored for
apps to not allocate objects.
If you've been to the talk this
morning, things have changed.
But back in the days,
the recommendation
was like, please
avoid allocations.
And this worked well
for, I think, five years,
until KitKat.
But there was a point where
Dalvik could not keep up.
Phones were getting bigger.
Phones were getting more
performance, more RAM.
That was 2013, '14.
I think it was 1
gig, 2 gigs of RAM.
And apps were also
getting bigger.
So initially, apps were supposed
to be like this small layer
between UI and the framework.
But apps started doing a
lot of more and more things.
So that 60 millisecond
window I talked
for rendering a frame--
well, more things
started to be executed there.
So we had to do
something about it.
And the answer
happened in Lollipop,
with ART, which introduced
ahead-of-time compilation.
So no more interpretation,
or very, very little.
And most of the things were
ahead-of-time compiled,
meaning we were executing
native code for your app,
and that is probably 20x
faster than interpretation.
We also introduced like
a state of the art GC--
what you find in regular
runtimes of being precise.
That means we're not going to
be confused by an integer that
looks like an object.
But also generations,
so that the GC
pauses we need to do in the
UI thread will be very short.
So pauses don't actually
end up creating jank.
The third incarnation is
like an evolution of ART.
It happened in two
releases, Android Nougat
and Android Oreo.
In Android Nougat, we introduced
profile-guided compilation.
I'll talk about this later, or
explain a bit later what it is.
But it drastically
helped on scaling ART's
ahead-of-time
technology to be more
optimized for the platform.
The profile-guided compilation
has underneath-- the way
it works is like it's a hybrid
just-in-time, ahead-of-time
compiler.
So we're trying to use
the best of both worlds
to optimize the platform.
And in O, after we'd done
all the optimizations in N--
in O, we focused on
the garbage collector,
and implemented a brand new
one, which makes the pause even
shorter on the UI thread.
We call this concurrent GC.
Now the GC happens on
a different thread,
so it's not affecting
the execution of the app.
All right.
So before I dive
into ART details,
I wanted to show this to you--
the state of Android
distribution today.
And in case you're still
optimizing for Dalvik,
or if you need to care about
Dalvik, and this annoying
GC_FOR_ALLOC.
If you'd been at
Chad's talk, you
know what I'm talking about--
well, there's still
this 10% here.
[INAUDIBLE] KitKat, Jelly
Bean and a few others.
So around 10% of devices
are still running KitKat.
So my recommendation
is, it still matters.
10% is probably like
20 million users.
It's quite a big number.
So it still matters.
But give it a couple
of years, and hopefully
in two years, that will be gone.
And Dalvik can be
part of this museum.
So things ART matters for--
I've put eight boxes.
They look nice.
And we do matter a lot for this.
Like, if we do get
it wrong, things
will go bad on your device.
Raw performance, I talked about.
That's Java execution.
Jank, application startup,
battery, disk space, RAM,
boot times, install times--
I'm just repeating myself,
but this is really important.
This is the thing that makes
your user experience kind of OK
so that you can enjoy the apps.
I'm going to go over the
releases I talked about-- what
the different incarnations
of the Android Runtime,
to show what makes ART today.
Because ART, like I said,
has a lot of evolutions.
But we also took good
things from Dalvik.
I'm listing the major ones
here, because the list
would be too long.
And obviously, the major thing
that the Dalvik architecture
brought was RAM savings.
And for that, Dalvik,
or the Android platform,
actually, introduced
the Zygote, which
is the parent
process that creates
all of the other processes.
So because it's
the parent process,
you have the option
of that parent process
starting up, or
allocating, a lot of memory
that apps can use.
And that memory can be
shared with the other apps.
And that's super important.
That means that
every app now doesn't
need to allocate this memory
that it would need, otherwise,
to actually execute in ART.
Today that's around a
couple of dozens of megabyte
that we save per app, and
that the cycle just allocates
and shares with the other apps.
Then Lollipop-- that was the
major shift when we introduced
ahead-of-time compilation.
Ahead-of-time
compilation happens
with what we call an SSA
compiler, Static Signal
Assignment compiler.
That's a compiler buzzword that
is like the state of the art
compiler that does a
lot of optimizations
and makes your code
up to 20x faster.
So we introduced the
ahead-of-time compiler.
That helped a lot
on reducing jank,
because now the
code was compiled,
not needing to be
interpreted, and very fast.
Reducing application
startup-- same argument.
But also saving battery.
Now with the execution
being 20x faster,
you can imagine that it's
not the point of saving
20x times on a battery,
but things get faster
and we don't need to execute
a lot on the CPU anymore.
We also save on boot times.
The whole Android OS is
ahead-of-time compiled,
and doesn't need to be
interpreted at boot.
So here we go.
Things go faster at boot.
We also introduced a
new GC, generational GC,
which reduced the pauses
and removed the need
for GC_FOR_ALLOC in Dalvik.
Then the third incarnation,
Nougat and Oreo.
I mentioned how
there, we introduced
profile-guided compilation.
And that thing helps--
it's kind of the mother of
all the optimizations today
that we do.
It helps a lot of these metrics.
It helps on jank, like
less code gets compiled.
The things that we care
about gets optimized,
so the UI thread needs
to run less code.
It helps on application startup.
Because we can profile
the application,
we're able to know what
matters at startup,
so that when we
recompile the app,
we recompile it
with optimizations
that optimize startup.
It helps on battery--
again, we're saving
on the amount
of things we're interpreting.
It helps on disk
space, because instead
of compiling the
entire app, which
was what Lollipop
was doing, now we're
only compiling the
hard parts of an app,
and that's probably like
10% to 20% of the dex code.
So 80% just doesn't
get compiled.
And that's a lot of savings.
It saves on RAM--
having a concurrent
GC means we can
do a lot more defragmentation
of the heaps of every app,
so we save that on
the fragmentation
that we had in the previous GC.
Profile-guided compilation also
helped a lot on boot times.
Remember the optimizing
apps dialogue?
Well, that's the reason
we were able to remove it.
Now, we didn't need to AOT
compile at boot all of the apps
to make sure the device was
reasonable in performance.
We were able to just, OK--
we take an OTA.
We're going to JIT
all the apps so we
don't need to compile at boot.
We're going to JIT
when the user wants it,
and then eventually,
we're going to do
profile-guided
compilation of the apps
when the user is
not using its phone.
And then finally, it
helped on install times,
because instead of
waiting for the compiler
to compile the entire
app when you install,
now we didn't compile at all.
We just rely on the
JIT the first time
the app was being used.
And lastly, I want to mention
Pie, because the time we
developed Pie was kind of at
the same time of Android Go.
And Android Go was a great
effort in the Android platform.
And for that, the
work we did was mostly
to save on disk space and
RAM, because Android Go is
like 512 to a gigabit of memory
and 4 or 8 gigs of disk space.
So most of our
efforts were focused
on improving RAM and also
improving disk space.
So in that release, we
introduced compact DEX,
which is a compact version
of the dex format, which
saves on RAM because
then less you need
to put into the memory
of the dex code,
the more you're
saving, obviously.
Also, when the APK has
uncompressed dex stored,
we will not
uncompress it on disk.
So before we used to uncompress
it to do optimizations
on the dex file, which we cannot
do on the APK because the APK
is signed.
So before Pie, we uncompress
it, do some optimizations,
and rely on the
first few iterations
before we do
profile-guided compilation.
Now we give the option
to the developer.
If the developer wants
to save on disk space,
then put the dex
file uncompressed
in the APK, which
means we're not going
to uncompress it on device.
We'll have just one
version of the dex code
and not a compressed
version in the APK
and an uncompressed one on disk.
That was a lot of optimizations.
I wanted to focus on one, which
is raw execution performance,
because what you
saw this morning was
pretty cool with [INAUDIBLE],,
but this is even cooler.
So obviously, the
faster ART runs,
the more we're
saving on battery,
on application startup,
and making the UI smooth.
So it really matters, all
the optimizations we do.
And over the
releases, we've kept
on improving the performance by
looking at actual applications.
In this case it's
the Google Sheets.
And every release
we worked on, like,
OK-- how do we improve
the Google Sheets app?
And the Google
Sheets team helped
us build benchmarks that
show how long it takes
to do sheets manipulation.
Here, higher is better.
Blue is Dalvik.
And that's a score of one.
So we make it relative to
Dalvik, the performance.
Red is Lollipop-- so that's
when we introduced ART.
And then yellow is today.
And you can see that
we went from around
like a 4x improvement when
we moved to ART and Lollipop
to like an average of
10x today, and even
to 26x on one benchmark.
So we're pretty happy
with those numbers.
But we just didn't
look just at Sheets.
We tried to also look at
what happened to other apps.
So a couple of
years ago, we also
worked with the Chrome
team and the YouTube team
to look at what they
think we should optimize.
And there again, after the
fact, even though we were not
focused on optimizing
those benchmarks,
we saw that we'd had this 4x
to 6x improvements with what
we've done.
There's two examples.
There's the
[INAUDIBLE] benchmarks.
That's the JavaScript
benchmark suite
that we ported for our purposes.
That's DeltaBlue and Richards,
and it's again, 2x to 4x, 3.5x,
for those benchmarks,
up to 6x in Pie.
And then ExoPlayer that's
the audio and video
processor driving the
YouTube app on Android.
Again, around 2x for
the introduction of ART
and then 4x today in Pie.
And while I have your
attention on performance,
I have a shameless call to do.
We are always super
interested in improving code
that you think is important.
So if on your side,
you'd like us to show off
how we improve
performance of your app,
please come talk to us.
There's the office hours from
1:00 to 6:00 this afternoon.
And we would be really
interested in knowing
what you think we should
care about for performance.
And then we can
show that off here.
So the question
then is, how did we
get this level of improvements?
I mentioned our ART now has a
modern compiler implementation.
I call that SSA.
And thanks to that
modern SSA compilation,
there's a bunch of optimizations
we're able to do now.
If you know compiler,
things could look familiar--
in-lining, dead
code elimination.
I'm not going to go
over all of them.
Lucky you.
But instead, I'll focus
on an example that
shows how the optimizations
matter, especially
for a language like Kotlin,
that puts a lot more
abstractions to help the
productivity of the user,
but makes it more challenging
for the runtime to optimize.
So let's take this
simple method.
Very simple.
It takes a function
that takes one argument
and then returns the length.
When we run that
through our dexer--
R8, our awesome new dexer--
here's the code you get.
Again, pretty straightforward,
even if you're not
familiar with dex code.
You're creating a string--
then, Kotlin having
[INAUDIBLE] types,
it'll make sure that
the string is not null
when you get it passed
to the function.
So it adds this helper method.
Hey, check that this
parameter is not null.
Then invoke [INAUDIBLE]
all of the length
method on the argument
and return that.
Kotlin comes with
the built-in library.
So that's where you can find
implementations of those helper
methods.
And for that case, it's
only like a simple method
that will just, OK--
is your argument null?
Yes?
Then I will throw,
call in another helper,
or I will just return and
return back to the method.
So method calls are
pretty expensive.
So the first thing
that ART will do
is that it will try to
inline that very small method
within the caller.
Here, the compiler is inlining
at the places being called.
Just for simplicity reasons,
this looks like dex code.
It's actually the intermediate
format of the compiler,
but I'm not going
to show that to you.
So compile code
is being inlined,
which helps on performance,
but there's more we can do,
because the compiler
sees, oh, wait--
that throw parameter
is null exception call.
It actually always throws.
So there's a few things I
can do with that information.
The first one is
called Code Layout,
where we're trying
to put together
the regular flow of the method.
So things that rarely
happen, we put that
at the very end of the
method, so it doesn't affect
the flow of the execution.
A nifty trick-- we just
switched the comparison from,
hey, are you not
0, to are you 0,
and then we jump to the end of
the method, which is like, hey,
throw an exception.
So the expensive jump is
out of the picture now.
The second optimization
is that we're
going to move things that,
hey, the regular flow doesn't
care about.
In this case, let
me just go back.
In this case, the
construction of the string
that is being
passed to the helper
was the first thing you
execute in the method.
But you only need that if you
end up calling the helper.
So we move that construction
of that string right
before the helper,
meaning we don't
need to execute it anymore.
So in the end, we started
from a method that
was like creating a
string, calling a helper,
then doing its thing, which
is returning the length,
to a method that's just
like, check if it's null.
If it is, jump to an
expensive jump somewhere.
If it's not, just
continue the flow
and return the
length of the string.
All right.
So that was raw performance.
I have two other
things to talk about--
actually, just
one, because I have
to talk about application
startup and garbage
collection, but I'm not going
to redo the garbage collection
slide.
Chad and Omar did a
great job this morning.
So with application
startup, it's
been a major focus
since we introduced
profile-guided compilation.
And that happened in Nougat.
Profile-guided
compilation is when,
when the app is
being installed, we
compile it in a very quick way.
Like, we're not going
to generate a full AOT
compilation.
We're going to do very
little optimizations that
do not affect install time.
So we're optimizing
install time.
So the app is being installed.
Then you run it.
The app is being executed.
Initially, it gets executed
with interpretation,
and then method
gets hot, and then
JIT kicks in and compiles
those hot methods.
The JIT knows what
those hot methods are.
So we are going to
dump to a profile
file those hot methods so that
when your device is idle--
the user's not using it.
It's charging.
100% charge.
Then we have this what we call
profile-guided daemon that will
just like, OK--
let me work over all the
profiles and recompile the app,
and compile only the things that
matter based on that profile.
And you have like this
virtuous loop where,
the next time you run the
app, we're going to use that
optimized version of
the compiled code,
and then run it
with what got AOT'd.
Maybe some methods got missed,
so we'll interpret them.
They'll get hot.
We'll JIT them.
We'll update the profile,
and then again, the daemon
kicks in and says, oh,
the profile got updated.
Let me recompile the app.
So there's this virtuous
loop of trying to be
better and better over time.
And why is that helping
on application startup?
Well, that's because
the things we
do when we compile the
app based on the profile
are really optimized
towards this.
We are only going to
compile startup methods.
So now, no need
to interpret them.
Things that get executed at
startup will get compiled.
We're going to lay out the
dex and the compile code,
so things that
execute at startup
will be next to each other.
So now we don't need to
jump over the entire dex
file to actually get
access to the method.
And that's very important.
Like I said, apps got bigger.
So if you need to bring up
the entire dex file just
for startup, that's a lot
of time waiting on IO.
So we're trying to reduce
that by putting everything
on startup at the beginning,
and then the rest at the end.
Profile-guided compilation also
generates an application image.
Other runtimes-- we'll
call this a snapshot.
It's a representation
of Java classes
that we put in that image.
It's a file.
And that avoids us to actually
load the classes at runtime
again.
So there's this
pre-formatted number
of classes with a class
loader, and when we startup,
we just take the class loader.
All the classes are
already populated.
And we're done-- we don't need
to do class loading anymore.
We're also going to try
to pre-initialize classes.
So Java has this
type of, oh, classes
need to be initialized before
they need to be executed.
So what we do during
profile-guided compilation
is that we're going to
pre-initialize anything we
can to avoid that being
executed when we start the app.
And then finally,
I said we're not
going to compile code
that doesn't get executed.
That helps a lot, because
then your compile file
is very small, so there's not
a lot you need to bring up
in memory to actually execute.
What do we gain from
all those optimizations?
Well, we always gain
doing those optimizations.
But depending on the app,
it can be up to 10% or 30%.
And that's usually around how
many Java code do you have
when you start your app.
Typically, Camera has
lot of native code.
So that's where it's on the low
end of like 10% improvement.
But in this example, you
see Docs and Maps, which
are very Java heavy,
go from around 30%
of app startup improvement.
And this is numbers that we
got from the Maps team, who
got that from actual users.
So actual data that
comes from the field.
And when the Maps
team saw that graph,
they were like,
what is going on?
How come at install,
things are around
like one second of app startup,
to over time things get faster?
How does that happen?
And every time they update
the app, it's the same trend.
It starts pretty high,
and then goes low.
And the answer is
profile-guided compilation.
Here, you're clearly seeing that
over time, things get better.
Today in Pie-- what we
talked about at I/O last year
is the introduction of
profiles in the cloud.
And that's how we're making
the entire ecosystem send us
profiles--
like, actual execution
profiles of users,
so that we can
send those profiles
to new users of the
app, so they don't
get this starts at one second,
ends up at 750 milliseconds.
They get the 750
milliseconds right away,
because they get the profile
at the point they install.
Garbage collection.
Like I said, I'm
not going over it.
Maybe I can just
put back a number
that we're all very proud of.
Here we are.
Ah, that's the last.
So this is resuming what Chad
talked about this morning.
It's all the
technology we've used
over time for building a GC.
So you see in KitKat,
we had this, what we
called concurrent mark sweep.
There was one part of the
GC that was concurrent.
And that stayed for
up until Nougat.
In Oreo, that's
when we introduced
a concurrent collector.
Allocation in KitKat
was the main bottleneck.
And it was single
threaded, so it
needed to lock to actually
allocate something.
The introduction of
a new GC in Lollipop
meant that we could
allocate within the thread
and not need to lock.
So that improved
performance of allocation.
When you allocate objects,
they are short-lived, right?
And that's the motto
of Java is like,
feel free to allocate objects.
The ones that are short-lived
will be removed by the GC
very quickly.
But in KitKat, in Dalvik
days, that was not the case.
You paid a very high cost by
allocating temporary objects.
Lollipop is when we
introduced a new GC.
And you didn't pay
that cost at all.
Like, allocating a
short-lived object was--
we had generations, so things
were removed pretty quickly.
There's an asterisk for Oreo,
because when we introduced
concurrent collector, we
removed the generations out
of the collector.
We're fixing that today.
It's in the OSP--
the improvement of the
GC with generations.
So hopefully it will be
there in the device soon.
And then fragmentation--
fragmentation
is a big problem in Android,
because if you're not
able to allocate memory,
your app will be killed.
So doing compaction
of the memory,
so that things are not
fragmented, is super important.
KitKat did a bit,
but very little.
In Lollipop over
Marshmallow, we were doing it
when the app was
going background.
So eventually, we're
reclaiming the memory.
But Oreo is when we made it like
it's really important that we
compact all the time, so that
the memory is there, available,
all the time.
And then the number
I was looking for--
allocation speed.
We went from a very
low number in Dalvik
to an 18x improvement
in Oreo and Pie.
And here's the reasons
it got improved.
Lollipop added a
custom allocator
that did not need to lock.
Then in Marshmallow, we
had fewer CAS operations,
automatic operations,
that have a cost,
but we were able to
remove a bit of them.
Then all of that implementation
of the allocation path
was moved to Assembly
code in Nougat,
which made things even faster.
And then finally,
in Android Oreo,
we implemented bump
pointer allocation, which
meant the only thing
you do when you allocate
is increment a pointer.
All right.
With that, this is the
recommendation that Chad has.
And that comes from us.
So I'll give the same.
Creating garbage is OK today.
You can use a type and
allocate objects you need.
GC is still overhead.
So be mindful that if you
allocate a lot of objects,
then GC will need to run.
But it's less a
problem since Dalvik.
And with that, thank you.
[APPLAUSE]
[MUSIC PLAYING]
