THOMAS NATTESTAD: Hi, everyone.
My name is Thomas and together
with my colleague Ingvar here,
we're going to show you
how using WebAssembly
can speed up your
computationally intensive
workloads by more than 10x.
And how using modern
WebAssembly tooling
can let you take advantage
of WebAssembly more easily.
We'll start by reminding
everyone what WebAssembly is
and showing some
of the improvements
we've been making to
Chrome's implementation.
Then we're going to get into
some of the different language
features that are starting to
ship as part of WebAssembly.
And then finally,
we'll close out
by covering some of the
new tooling updates that
have been coming as well.
So let's start by
reminding everyone
what WebAssembly actually is.
WebAssembly is a new
language for the web that
is designed as a
compilation target
to offer maximized and
reliable performance.
It's important to
remember, though,
that WebAssembly is in no way
meant to replace JavaScript.
Rather, it's meant to augment
the things that JavaScript
was never designed to do.
So let's look at some of
the different advantages
of WebAssembly and why
you might want to use it.
First, because WebAssembly
offers strong type guarantees,
it gives you more consistent
and reliable performance
than JavaScript.
Then, with additional features
like threads and Cindy,
which will get more
into later, you
can also achieve speeds that
are truly higher than what
you can with JavaScript.
When thinking about comparing
baseline performance
WebAssembly to JavaScript,
I find this metaphor
which my colleague [? Sama ?]
came up with really useful.
JavaScript is like
running along a tightrope.
It's possible to go fast, but
it requires a lot of skill
and it's possible to
fall off the fast path.
Whereas baseline WebAssembly is
more like running along a train
track.
You don't have to be as
careful in order to go fast.
Another advantage of WebAssembly
is its amazing portability.
Because you can compile
from other languages,
you can bring not only your
own code bases and libraries
to the web, but also the
incredible wealth of open
source libraries built
in languages like C++.
Lastly, and potentially
most exciting to many of you
out there, is the possibility
of more flexibility
when writing for the web.
Specifically, the ability
to write in other languages.
Since the web's
inception, JavaScript
has been the only
fully supported option.
And now through WebAssembly,
you get more choice.
Most exciting
though, is the fact
that WebAssembly is now
shipping in all major browsers,
making it the first new language
to ship in every major browser
since JavaScript was created
more than 20 years ago.
So now that we all are reminded
of what WebAssembly actually
is, I want to cover
some of the improvements
that we've been making
directly in Chrome.
One of the biggest requests that
we've heard from our developers
is the desire for
faster startup time.
To improve startup time
for WebAssembly modules,
we're starting to
roll out something
we're calling implicit caching.
To recap, when a site
loads a WebAssembly module,
it first goes into
the lift off compiler
to start executing immediately.
It then is further optimized
off the main thread
through the turbo fan
optimizing compiler,
and then the result is
hot swapped in when ready.
Now, with implicit
caching, we also
cache that optimized
WebAssembly module directly
in the HTTP cache.
Then, after the user leaves
the page and comes back,
we load that optimized
module directly
from the cache, resulting in
immediate top tier performance.
As the name suggests, implicit
caching happens automatically.
But there are two tips worth
knowing and keeping in mind.
The first, is that code
caching in WebAssembly
works off of the streaming APIs.
So make sure it's always
used compile streaming
or instantiate streaming.
The second thing is
just to make sure
that you're being
cache friendly.
WebAssembly keeps the
cache based on the URL
of the WebAssembly module.
So if this changes
on each load, you
won't see any of the benefits.
In addition to new features
like implicit caching,
we're also always
making improvements
to our WebAssembly engine.
Here you can see how
commit by commit,
we've cut startup
time by almost half
since just the start
of this last year.
OK, so now that we've covered
some of the improvements that
have been made in
Chrome, I want to get
into some of the
actual new language
features of WebAssembly.
The first feature that
I want to talk about
is WebAssembly threads.
Threads are a key part
of practically all CPUs,
and utilizing them
fully and effectively
has been one of the great
challenges for the web
until now.
WebAssembly threads
work by relying
on three specific things--
Web Workers, SharedArrayBuffer,
and atomic operations.
Web Workers allows WebAssembly
to run on different CPU cores.
Then SharedArrayBuffer
allows WebAssembly to operate
on the same piece of memory.
Lastly, atomic
operations, specifically
atomic.wake and
atomic.notify, let
you synchronize your
WebAssembly so that things
happen in the right order.
Google Earth adapted WebAssembly
threads with great success.
They saw their frame
rate almost double
and their number of dropped
frames cut by more than half.
Soundation, a music
editing studio,
similarly adopted
threads to enable
highly efficient paralization.
As they increased their
number of threads,
they saw their performance
more than triple.
One application that I'm
particularly excited to share
is coming to the web through
WebAssembly threads, is VLC.
They were able to originally
compile their code base
to baseline WebAssembly.
But without threads,
they weren't
able to achieve anything
close to the performance
that they needed.
Now thanks to threads, they
have a working prototype
working directly in Chrome.
So going back to our
analogy from earlier,
if baseline WebAssembly it's
like running along a train
track, WebAssembly with threads
is like an actual train.
You're achieving speeds that
were previously impossible.
Threads have been
available in Chrome desktop
since version 74.
In Android, Chrome, and Firefox,
threads are implemented,
but not enabled
by default. We're
actively working with
other browser vendors
and the WebAssembly
community to make threads
available in more places.
[? Send ?] threads are
not supported everywhere.
It's critical to use
feature detection
before relying on their
presence, which Ingvar will now
show you how to do.
INGVAR STEPANYAN:
Thank you, Thomas.
Unfortunately,
WebAssembly does not
have a built-in feature
detection yet, although it's
being actively worked on.
For now, we created a
JavaScript library instead
that you can use to detect
WebAssembly features supported
by your browser.
This allows you to build several
versions of your WebAssembly
module, for different
feature sets,
just like you would
for modern JavaScript
bundles and dynamically choose
the ones that your browser can
handle.
For example, you can use threads
function in order to detect
[INAUDIBLE] [? browse ?]
[? simple ?] threads
in WebAssembly.
Then you can use
dynamic input to load
either version of your
WebAssembly module
and the JavaScript
binded set makes
user threads for optimizations,
or regular one for the older
browsers.
How do you build a version for
threads, in the first place?
If you're using a script and
you need to pass an argument
-pthread during
compilation, like you would
to regular, native C compilers.
And it will automatically
generate the WebAssembly module
and the JavaScript necessary
for creating, managing,
and communicating with the
Web Workers under the hood.
If you aren't in C
[INAUDIBLE] allows
you to use common
POSIX thread APIs,
just like those available
on native Unix platforms.
For example, you can
use pthread_create
with the handler
function and arguments,
in order to start a new thread
and [? writing ?] the code
pthread_join in order to wait
for it to finish and read
the results back.
If you write in C++,
good news has it,
Emscripten [? implemented ?]
an implementation of standard
thread APIs, just like in Unix
makes use of POSIX threads
under the hood.
And other high level
APIs, such as std::async,
makes use of std::thread
at the C++ standard level.
So they all just work.
This means that, for example,
you can use std::thread with
closures in the C++ code.
And it will [? lower ?]
to the same pthread goals
and handled by Emscripten.
Similarly, you can use
std::async APIs to spawn
futures, which are quite
similar to JavaScript promises,
but allow you to spawn
tasks on your threads.
And the [INAUDIBLE]
this stories,
not just [? been ?]
[? fleshed ?] out,
as you need to maybe
create Web Workers,
send them to WebAssembly module
and [? memories ?] that you
want to share, as well as
rebuild the standard library
with thread support.
However, after jumping
through a few hoops,
you are able to even use popular
multi-threading libraries,
like [? Ryan, ?] like in this
demo by Rust WebAssembly team.
Here, they [? brought ?]
[? in ?] a ray tracer to split
and read into several threads
and compiled it to WebAssembly.
You can see how,
with a single thread,
it takes 1.7 seconds to
render the entire image.
But if you split working,
to say, four threads,
it takes only 0.8
seconds, making it
more than two times faster.
Another performance feature
that is making its way
into WebAssembly is SIMD.
And I'd like to invite Thomas
back, to tell us what it is
and how it can help us.
THOMAS NATTESTAD: Thank
you, so much, Ingvar So,
SIMD stands for Single
Instruction Multiple Data.
And while this may not be a term
that most web developers are
familiar with, it's
an absolutely key part
of modern CPU architectures.
So to explain SIMD, let's
take this simple example
of adding two arrays
together into a third array,
using a simple for loop.
Without SIMD, the CPU
goes through this loop
and adds the different
elements together, one by one,
taking four full steps.
Now, with SIMD, the CPU is able
to vectorize these elements
and then take just a single CPU
operation to add them together.
This may seem simple, but
it can have dramatic impacts
on performance.
To show the power
that SIMD can deliver,
I want to show off some of the
work done by our colleagues
at Google Research.
They've developed several
real-time ML models
that can do everything from
letting you try on fake glasses
or puppet masks, doing dynamic
background removal, and much
more.
One of the coolest demos is
this hand tracking system.
And here, you can really see
the difference that SIMD makes.
Without SIMD, you're only
getting about three frames
per second, while
with SIMD, you've
got a much smoother
15 frames per second,
which makes all the difference.
You can visit this link
to check these out for
yourself or come by the
sandbox to play with them.
The Google research team looked
at a bunch of their models
and found that, in general,
SIMD offered a 3x improvement
on overall speed.
The next example that
I want to show off
is OpenCV and some of the
work done by our friends
at Intel and UC Irvine.
OpenCV is an extremely
popular image analysis library
that has tons of performance
dependent functionality.
OpenCV can be compiled
to WebAssembly
and run directly in the browser.
It can be used for doing
things, like card reading,
replacing real
emotions with emojis,
and for all the Harry
Potter fans out there,
you can now have your very own
web-powered invisibility cloak.
You can visit this
link to try them out.
Or again, come by the sandbox
to check and see them there.
This work has actually been
fully upstreamed into OpenCV.
And they even have
a tutorial on how
to setup OpenCV
with the Emscripten,
so that you can all play
with this yourself, at home.
And all of this functionality
can take advantage of threads
and SIMD to dramatically
improve performance.
Here we can see the
visual difference
of first adding SIMD and
then SIMD plus threads.
And our benchmarking backs
up this visually noticeable
difference.
When using both threads
and SIMD together,
common tasks in OpenCV can
be improved by around 15x.
And some of the benchmarks show
even more dramatic improvements
from threads and SIMD.
For the OpenCV kernel
performance test,
using threads gives
you a 3.5x improvement.
And using SIMD gives you an even
more impressive 9x improvement,
just by itself.
And then when you
take these together,
it results in an
overall 30x improvement
to this performance test,
which is truly staggering.
So coming back to
our train analogy,
because who doesn't love trains,
if WebAssembly threads is
like an old-style train, using
threads and SIMD together
is like a modern bullet train.
So to show you how to actually
take advantage of this in code,
I'd like to hand
it back to Ingvar.
INGVAR STEPANYAN:
Thanks, Thomas.
To build code with
SIMD and Emscripten,
you need to pass a special
parameter -m, which
tells Dandelion's
[? sealant ?] compiler
to enable a specific
feature, followed by simd128,
which is the feature name
for the currently supported
128-bit SIMD operations
in WebAssembly.
In Rust, you need to pass the
same feature name, by a -C
target-feature compiler flag.
The easiest way to do
this on a real project,
using cargo wasm-pac
is currently
[? serene ?] environment
variable RUSTFLAGS,
passed during compilation.
Now that we've covered
how to compile our code,
let's see what it takes to
actually use SIMD in our code.
The good news has it,
in the simplest case,
the answer is nothing.
That is, unlike with threads,
SIMD [INAUDIBLE] compiler
can often make advantage
of, and take care of,
without you having to
modify any code at all.
This compiler feature is
called auto-vectorization.
And it detects
loops that perform
[? same ?] mathematical
operations on array items,
independently.
For example, let's take a
look at this simple code in C.
On [INAUDIBLE] one in C++
All the same one, in Rust.
Such a loop operates
on an array of numbers.
Check.
It performs
arithmetic operations.
Also, check.
And it clearly operates as
an independent [INAUDIBLE]
Also, check.
So the compiler should be
able to make use of SIMD
to process several elements at
once-- [? Ryzen ?] handles them
by one-to-one--
and make it faster.
Let's see if it does.
First, let's compile this code,
in any of the source languages,
without SIMD enabled
and take a look
at the interactive WebAssembly.
We can see that our function
gets compiled to a loop.
Set loads an item from an
array, multiplies it by 10,
and stores the result back.
No surprises here.
Now, let's get our compiler
to be SIMD enabled.
We can see is that, aside
from our regular boilerplate,
there is now another
loop that loads
four items out of an
array, multiplies them
by four instances of number
10, and stores the result
back, also in just
one operation.
While this improvement
[? is an ?] example, and not
a real-world benchmark,
it's interesting to see
how such implicit
optimization can
help to achieve a
consistent three times
increase in performance
of the generated code.
In some situations,
however, you don't
want to leave it to chance
to have your code optimized
this way or your data
has a specific layout
or you just want more control
over which features are used.
This is where intrinsics
can come in helpful.
Intrinsics are
special helpers that
look like regular
functions but correspond
to specific instructions
on the target.
For SIMD in Emscripten
they [? leave ?]
in wasm_simd128
header and content
all basic operations
for creating, loading,
and storing, and operating
at once the supported
SIMD vector types.
In Rust, the easiest way to use
them is [INAUDIBLE] external
packets in [? the ?] crate,
which is intended to be
a prototype for a future
[? Standard ?] [? Library ?]
API.
One important thing to keep
in mind is that SIMD is still
experimental and available
only in Chrome [? under ?]
[? flag. ?] So just
like with threads,
you need to make a separate
build that makes use of SIMD.
And then use a feature
detection library to load it,
only if it's supported.
Now that we've covered
new WebAssembly features,
we've got some exciting
tool implements
to share with you, too.
First if all, earlier
this year, [? LLVM, ?]
the compelling infrastructure
behind projects,
such as Clang and Rust
and lots of others,
has stabilized and finished
support for WebAssembly target.
This includes both compilation
of separate source files
into WebAssembly object
files, as well as linking them
together into the final module.
It's not very usable on its own.
For example, while it allows
you to compile a separate C/C++
files into WebAssembly, it
doesn't include any standard
library.
And it expects you
to bring your own.
However, it does provide a solid
foundation for other compilers
to build on.
Let's take a look at Emscripten.
Before this, Emscripten had
to maintain a complex, custom
compilation pipeline and a
fork of LLVM, called fastcomp.
In order to parse an
intermediate representation
from Clang, compile
it to asm.js,
and when WebAssembly came along,
also converted to WebAssembly.
Having to work around
LLVMs, this way, led
to various incompatibilities--
[COUGH]
[? --reported, ?] [? such ?]
[? as ?] difficulties during
upgrades and suboptimal
compilation performance.
Now since the
WebAssembly support
has been properly
integrated into the LLVM,
Emscripten can leverage it
to simplify the compilation
process and focus on
providing a great development
experience, custom features,
and a standard library,
while all core work, for the
features and optimizations,
can be continued to
be developed upstream.
As an example of
improvements [? reaching ?]
to the native backend allowed
Emscripten to significantly
improve linking times,
with a small extra cost
to its initial compilation.
This particularly helps on
incremental development,
where you usually modify
and recompile only
like one, two files, at a time.
And all you need is
a fast linking step.
Some projects have seen as
much as seven times improvement
in recompilation
times, in such cases.
However, there were some
compile-time features,
unique to Emscripten,
that were previously
handled by the earlier
mentioned fork of LLVM,
and could be lost in transition.
One of such features
is Asyncify.
Normally, when calling from
JavaScript to WebAssembly,
and then from WebAssembly
to some Web APIs,
you expect to read the result
back, continue execution,
and eventually
return to JavaScript.
However, many long,
[? grinding, ?] and expensive
Web APIs tend to spawn
asynchronous tasks,
to avoid blocking the [? main ?]
[? thread. ?] This includes
[? Timeless, ?] Fetch
API, Web Crypto API,
and lots of [? others. ?]
Because WebAssembly does not
have a notion of event loop
promises or synchronous
tasks, [INAUDIBLE]
would look like
the external API,
as soon as it
finished execution.
So it can continue running
users code, immediately,
while the async task is still
running in the background,
with no handlers attached.
This is not what
we normally want.
We want to not only be able
to start an asynchronous task,
but also wait for it to
finish, read the results back,
and continuous
execution afterwards.
This is where I
Asyncify comes in.
I wont go too much into
implementation details here.
But what it does is compiles
the WebAssembly module in such
a way that you can
suspend execution,
remember the state,
and later, resume
from the exact same point,
when an asynchronous task
has finished its execution.
This is quite similar
to await, in JavaScript,
but applied to native
functions and with no changes
to your own code.
In order to use it
from Emscripten,
you need to pass a
special parameter,
-s asyncify, and specify
which [? imports ?] should
be treated as asynchronous.
The great news are--
so in your code, you can use
regular function imports.
And it evokes them as
any other functions,
while Asyncify does
magic under the hood.
The great news was that, with
the transition to the upstream
LLVM [INAUDIBLE] the backend,
this feature has not gone
but was extracted as a separate
transform and can be now used
from any languages
and not just C/C++,
as long as they
compile to WebAssembly.
For example, you
can simply invoke
asynchronous
JavaScript functions
from Rust, which is
particularly helpful for
[? both ?] [INAUDIBLE] standard
synchronous system APIs,
available on other platforms.
Since you are not
using Emscripten,
in this case, after
you have compiled
your module into [INAUDIBLE]
using wasm-tool, instead
and it will add all the
necessary magic for spending
[INAUDIBLE] execution.
Then, you'd need some loop on
the JavaScript side, as well.
We have [INAUDIBLE] for use.
It mimics our regular
WebAssembly API.
But [? it allows ?]
instantiates modules
with asynchronous
imports and exports.
To use it, first, import is
from asyncify-wasm [INAUDIBLE]
module.
And then, you can use
regular instantiation APIs.
But we use asynchronous imports
and exports, in addition to
the regular ones.
Since now your
WebAssembly module
might invoke asynchronous
APIs in arbitrary points,
all the exports need to
become asynchronous, too.
So you need to [? prefix ?]
[? calls ?] to your exports
[? with a write. ?]
And you're good to go.
One particularly interesting
use case for Asyncify,
aside from external
APIs, is in Emscripten.
Emscripten allows you to mark
parts of your code, that's
rarely used, and splits them to
a separate WebAssembly module,
during compilation.
[? add-lazy ?] loads them,
only when it's invoked.
This allows us to keep
your initial bundle small,
without any breakage to your own
code and with minimal changes.
To use it, you need to
call a special function,
emscripten_lazy_load_code.
During compilation, it will
extract any following code
into a separate
WebAssembly module.
[? Send ?] during runtime when,
or if, that code is actually
reached during
execution, Emscripten
will use Asyncify to dynamically
load the missing pieces
and continue as if
there was never split,
in the first place.
This all great features.
And it's amazing to see
how our WebAssembly is
growing over time.
However, with this
feature [? course, ?]
the surface area of potential
boxes expanded, as well.
When things go wrong,
and we all know,
they often do, you want
to be able to track where
the problem occurred,
reproduce it step by step,
track the inputs that led to
the issue in the first place,
and so on.
You want to be able to
debug a application.
Until recently,
you had two options
for debugging WebAssembly.
First, you could get
[? your ?] stack traces,
as well as step over
individual instructions
in that WebAssembly text format.
This helps somewhat with
debugging of small isolated
functions.
But it's not very practical
for larger ops, where
the mapping between
the disassembled source
and your original
sources is less obvious.
To work around this
problem, Emscripten DevTools
have initially adapted the
existing source maps format,
which was designed for
languages that compile
to JavaScript for WebAssembly.
This allowed to
map binary offsets,
in the compiled module, to the
locations in original sources
files.
However, this format was
designed for text languages.
We use clear mapping to
JavaScript's concepts
and values, and not
for binary formats,
like WebAssembly, using a memory
arbitrary source languages
and arbitrary type systems.
This makes the
integration hacky,
limited, and not
widely supported
outside of Emscripten.
On the other hand, many
native languages already
have a common
debugging format that
contains all the necessary
information for the debugger
to resolve locations, variable
names, type layouts, and much
more.
This format is called DWARF.
While there's still some
WebAssembly-specific features,
that need to be edited
for full compatibility,
compilers like Clang
and Rust already
support emitting
DWARF information
in a WebAssembly modules,
which allows us to start
using directly in DevTools.
As a first step, we went ahead
and implemented native source
method.
So you can start debugging the
WebAssembly modules produced
by any of these
compilers, without having
to resort to disassembled
format or [INAUDIBLE] scripts
for source [? map ?] generation.
This integration only covers
stepping in and offers a code
in any of these language,
set in breakpoints,
and resolving stacks traces.
There's still much
more we can do
though, such as
[? preprinting ?] types
or even evaluating expressions
in the source languages.
We are actively
working on bringing
this and many other improvements
to the WebAssembly experience.
So please stay tuned
for the future updates.
And thank you, for
your time, today.
[APPLAUSE]
[MUSIC PLAYING]
