MODERATOR: Welcome everyone,
today's presentation is about
C++11 atomics by Robin.
Robin interned with us in
the Chrome pinnacle team.
He's a PhD student at ENS Paris.
And so his internship
is finishing this week.
He'll be giving a same or
similar talk at the LLVM
Developer Conference
later this week.
So please keep the questions
towards the end, and with that,
Robyn.
ROBIN MORISSET: Thank you.
So I will start by
quickly explaining
what these atomics are, and
what they are useful for,
and then focus on how
we can optimize them
in the LLVM Compiler.
So the problem is
how can you write
a portable and [INAUDIBLE] card?
Because for example,
with this card,
it may seem clear that
you will never print 00.
Because if you look at all
of the possible [INAUDIBLE]
actions, you we see
that you can never
have both prints come before
their respective store.
But in practice, you can
see 0-0 if you actually
execute that card.
Because as most processors,
including the x86 family,
can reorder these [INAUDIBLE].
More precisely,
the kind of sensors
store in a store buffer,
and then execute the load
before the store
has been propagated
to the other the core
or to anywhere else.
So you need some extra
instruction [INAUDIBLE] fence
to a [INAUDIBLE] processor.
Don't do that. [INAUDIBLE]
buffer and do things in order.
So now, what about this card?
It may seem like it
will never print a zero.
If it prints anything,
it should print 42.
Indeed in x86, you won't
have any issue with this.
Because the only tricky
thing is a stop buffer.
And a stop buffer is
first in, first out.
So by the time the
store of one is ready,
is sent to the other core,
the store of 42 in 2x
was already sent.
So no issue.
But on ARM, for example, the
stores in the store buffer
can be sent in a
different order.
So you can see ready equal
1, before you see x equal 42.
So on ARM, you need--
actually it's [INAUDIBLE].
And in practice,
it is even worse,
because ARM processors can
also speculate the read of x.
And say, well, I will first read
x, then I will read the ready,
and do the branch and
print whatever value
I write before
[INAUDIBLE] ready.
So you need another fence
on the second thread
to prevent that kind of thing.
So all this is [INAUDIBLE]
tricky, and even worse,
it cannot be cached by a test.
For example, if you are testing
on x86 and then someday later,
someone will run a
[INAUDIBLE] on ARM,
you may have bugs
waiting for you.
So in the [INAUDIBLE],
they added
ways of portably
specifying where
you want the compiler to
introduce such fences.
So in a nutshell, what does
the simplest [INAUDIBLE]
says is that if you
have a race, meaning
if in any execution
of your program,
you have two accesses
to the same location,
with no synchronization in
between, and at least one
of them is stored, your
[INAUDIBLE] program
is completely undefined.
Your compiler can do
whatever it wants.
If you have logs,
used everywhere
to enforce a synchronization,
then you just
have the old intuitive behavior.
You only have to
look at interleaving
of your actions,
no tricky parts.
Finally, because
it is a C, you may
want to write some
lock-free code,
and still have well-defined
programs, so you
can tag every load of store
with a special atomic tag,
telling the compiler, I know
that this access may be risky,
don't consider it undefined.
Instead give it this
or that semantics.
So the first of such
tag is a [INAUDIBLE],
which is the stronger one.
So it is the most
expensive, pretty much
always require at
least one time.
And so this is a first
example that we show,
that was failing on x86, 1/6.
So [INAUDIBLE] safety says
that if all of your accesses
have this attribute, then
you can get the [INAUDIBLE]
just by looking at the possible
interleavings, as the accesses.
However, we saw that-- we
saw that for this, we only
need a fence on x86, not on ARM.
So it would be wasteful to
use the expensive [INAUDIBLE]
attribute, because
then the compiler
would act 10
fences, even on ARM.
So you have a
cheaper one, actually
a pair of a cheaper one
called release acquire.
The meaning is that whenever
you load with a acquire
tag, something which was
stored with a release tag,
everyone after you
will see everything
that was done
before that release.
So it creates them
as synchronization,
and so here, for example,
the load of ready--
because it is acquired, if
it loads from the store that
is released, then it will
also see as a new value of x.
So one obvious question
is, what kind of impact
does this have on older
compiler optimizations?
So one simple example
is [INAUDIBLE] motion.
So here we have code where in
every iteration of the loop,
we are producing objecting
x, and the compiler would
like to just do one store
to x at the end of the loop.
So is this correct?
Well, it is tricky, because
n can be equal to 00.
And then you have no
iteration of the loop.
And you are just setting x
to some temporary variable,
and then restoring it.
And this is unsafe,
because you may
have another thread,
some lines modifying x.
And x has no atomic attribute.
So it-- yes?
It is just a point--
AUDIENCE: So it's
++ [INAUDIBLE].
ROBIN MORISSET: Oh, yes.
Sorry.
So, yes, we may have another
thread with modified memory
pointed to two by x.
And so it is undefined.
And actually in this is
case, it is [INAUDIBLE],
because we may override
the new value of x.
So a simple rule is to never
add a store on a password, that
could've been no store
to the zeta valuable.
Because you are
introducing a race,
possibly because you cannot
know what another thread may be
doing at the same time.
Now what about
eliminating things?
It seems much safer
than introducing things.
So for example, here,
you condition an
eliminating the
store 42 into x, is
that you have no access
in between the two
stores that [INAUDIBLE] with x.
But for example, if you
have this code in between,
with another flag one or
flag two, analysing x--
it is still unsafe to
eliminate x, x equal 42.
Because you're can
have another thread,
which is reading x,
and then printing it.
And you may form that other
thread to see that x is 42.
So it is not to safe,
as it is in this case
to do the dead
store elimination.
So an interesting point that if
you only have an equation, then
it is a safe to
eliminate x equal 42,
because any possible read
of x in another thread
would be racing
with that stop of x.
So the compiler can
be thinking, well,
I assume that the
programmer is careful,
and is not to using any
identifying behavior.
So I know that no other thread
is modifying x, or reading it.
Similarly if we only have
a release and acquire,
we are safe.
So the only issue,
the only hazard
is an issue of a release
followed by an acquire.
Because then you can give
control to another thread,
and then wait and take back that
control all without any kind of
[INAUDIBLE].
So we saw that optimizing--
[INAUDIBLE] on when
[INAUDIBLE].
So another idea is, can
we optimize the atomics
themselves?
So in one example, if you
just compile every access one
by one, you will need an ARM to
add a fence after every acquire
load, and a fence
before every restore.
And this seems
wasteful, because you
may get two fences in
a row, which is not
stronger than just
having one fence.
So you are going to
have one unused fence.
A slightly tricky
case is if you have
some control flowing in between.
Because here you can not
remove any of the two fences
without breaking some
ordering with the other store.
But you still have two fences
in a roll on the main path.
And so a more optimal
placement for the fences
would be either this or
this-- which in both cases,
guarantees exactly
the same ordering.
But you only have one
fence on the main path.
So I looked at how
to do that, and it
is very similar to the issue
of a partial redundancy
elimination, in that
you want it to move
some fence to another path,
to be able to eliminate it
on the main path.
So I implemented
an algorithm from
the partial redundancy
elimination literature,
which is built on
top of a min-cut.
So first you build your
graph with one node
before and after every access.
And one node activity point for
where the control flow splits,
and one node at every point
where the control flow joins.
Then for each fence, you
put every access point
is a source set, and
every access after it
is this sync set.
You do the same for
the other fence.
Then you put for
each edge, how much
it would cost to execute
a fence on that edge.
So [INAUDIBLE] here, we put
[INAUDIBLE] on the cross edge--
on the across block
edges, because we
don't want the competition
of adding fences there.
But we can use the
frequency of the execution
to see that these edges are more
costly to other fences to this.
And then you do a min-cut, and
put a fence on the min-cut.
And so the idea is that
you don't literally
have at least one fence on
every path from an access that
was before a fence, to an access
that was after that fence.
So it is guaranteeing
exactly the
ordering conditions
that you want,
but not minimizing the
dynamic cost of the fences.
And so we get exactly
what we were hoping for.
So it may seem a bit too
much work for just this,
but there are many other actions
that it can be extended for.
For example,
commencing spin loops
is to wait on an acquire load.
So if you are doing a load
acquire again and again,
and in every iteration,
you enter the cost
of the fence-- which is creating
some kind of a content tension.
So you would like to think of
the fence outside of the loop.
This is a finer [INAUDIBLE],
because the ordering
between the different
loads is already
guaranteed by the processor.
So they are loads of
the same [INAUDIBLE].
So you can't have
any kind of an issue.
You only want to
synchronize with whatever
will happen after the loop.
And so if we look
at this from the--
AUDIENCE: What does
ldr r0 [INAUDIBLE]?
ROBIN MORISSET: Oh, this is just
loading into r0 the value which
is pointed at the
current point on 0.
AUDIENCE: But that's
not [INAUDIBLE].
AUDIENCE: [INAUDIBLE]
ROBIN MORISSET: Oh, yeah, right.
Yes.
Right, it should have been r1.
Sorry, I should have-- I tried
to minimize the assembly,
and I minimized it too much.
Thank you.
So if we look at the
control flow graph,
the issue here is that
this axis at the top
is also in the sink set,
because of the loop.
But we can't just
teach the algorithm
that whenever it is tracking
a fence, which was introduced
by an acquire, it will keep
that access when building a sink
set.
And then this will
automatically sink the fence,
because as soon
as it can be sank,
it will be much cheaper
to do outside of the loop
than inside of the loop.
AUDIENCE: [INAUDIBLE]
some weight [INAUDIBLE]
replacing a partially
local weight with something
that all the way [INAUDIBLE] I'm
not sure what you actually want
[INAUDIBLE]
ROBIN MORISSET: I'm
not entirely sure,
but I think that the fence
can have some global impact.
Because you must communicate
with the other cause.
So I think you just want to
reduce the content tension,
it would be a matter of
adding some [INAUDIBLE]
or some [INAUDIBLE],
instead of a fence.
AUDIENCE: [INAUDIBLE]
ROBIN MORISSET: OK.
It will just-- one
example of the way
that it can be extended
in various ways
for different patterns.
So then I also looked at
what LLVM was, actually
doing a simple example
[INAUDIBLE] that there are lots
of smaller things on
the patterns to fix.
So one example from
a pair of [INAUDIBLE]
is that sometimes when
you have a set clock,
you would like to end a
critical section with a load.
And so to have
the right ordering
and to make sure that
nothing can be sank out
of a critical section, would
like that load to be released.
There is no such thing as
a release load in C 11.
So one way of trying to make
it, would be with a fetch
and add of 0,
which is a release.
Because then you are
saying, I'm doing a store,
but that does nothing,
which is release.
So that it acts as
a one way fence,
and keeps everything
before it before it.
And I am also getting as a
side effect, the value of x.
But if were to do that
on a x86, for example,
you will get instructions
for-- [INAUDIBLE]
add that will touch the
cache timing writing,
that we'll try to write to it.
Even if it is exactly
the same value.
And so it will add
some content tension,
because you will ask
for an exclusive access
to that cache line.
And so when a
[INAUDIBLE] would be
to see that, while I am
never modifying thing x,
I might as well just load it.
And in practice,
[INAUDIBLE] you still
may need a fence past this.
Because of the
[INAUDIBLE] of a release.
But it is useful, because you
are releasing a lot of tension,
and so you are improving
your scaling a lot.
Another example is
in the [INAUDIBLE]
which fences should you use.
Because most [INAUDIBLE]
[INAUDIBLE] some heavyweight,
expensive fences, like
heavyweight sync on a power,
or a dmb sy on ARM.
And for release and
acquire, you can [INAUDIBLE]
sync-- things like
lightweight sync and dmb
ish, which are not always done.
But even better, you
can do an optimization
for specific professors.
Because for example,
the Apple people
are told that for their swifter
processors specifically,
dmb ishst is strong
enough for rerelease.
And so this is I
think a good reason
to use atomics instead of
running your own with assembly.
It is that you kind of hope
to know every special tricks
for every possible processor.
But the processor
writers, makers
can teach these traits
to the compiler.
And so for example, LLVN
[INAUDIBLE], it should tell it
I'm am running on
that specific machine.
So another simple
example was a bug
with [INAUDIBLE], which
was using extra registers.
A more friendly on
was on Power, where
you have this horrible
code when you were just
storing a byte to a byte.
So first you had all kinds
of shuffling of the bits.
And then a loop, doing a load
link at a stock condition
around every
iteration of the loop.
So actually what
was going on here,
which is quite funny, that
the back end-- instead of just
saying I have a store,
so I should make a store,
with saying, oh, it
is an atomic store.
I don't know what I
should do with that.
So the [INAUDIBLE] was
helpfully replacing every store
by the look compile exchange.
And then every
compile exchange was
replaced by some load linked
and stock conditional, that
take a look on the
cash line, basically.
And then because these
are only implemented
for words in the power
backend that are well-aligned.
You have all of this
shuffling, which
is trying to align
things in the right way.
And so actually, all of this
could be replaced by just,
I forth 42 into the register
r2, and then I store a byte.
So that was a pretty big win.
However, I'm not always
perfect with atomics,
even if now you can get
some decent performance.
Because there are some other
attributes, such as relaxed.
So the idea behind
relaxed was sometimes
the programmer
really knows best.
So they should be able
to say to the compiler,
I know I have a [INAUDIBLE].
I want it to be well defined,
but don't add fences.
I know what I'm doing.
So for example, this means that
you can print 1-1 in this code.
Because why not?
The processor could reorder
the [INAUDIBLE] loads.
Why not?
Because C++7, [INAUDIBLE] is
defined in terms of traces
of events-- from
it's point of view,
this looks a lot like this.
Where instead of a constant,
you have some dependency
and basically, some
kind of loop where
you are storing some value,
because you loaded that value,
because you had
started that value.
So after this, x equal y,
equal whatever value you want,
which is pretty scary.
But you can make
things even worse.
Because you can
have causal loops,
where you will enter in a
branch, because you enter that
branch.
And so from this thing,
you can build all kinds
of crazy counter examples
to pretty much any kind
of a analysis or reasoning
you may wish to use.
So it's a pretty big question,
how exactly [INAUDIBLE].
And the main thing most fences
would add some small cost
to relaxed.
So they would not
be perfectly free,
as they are currently,
in trying to retreat
this kind of craziness.
Another attribute is consume.
So consume is kind
of a cheap acquire.
So acquire was for
a load, to make sure
that everything
that comes after it,
will see everything that was
before, synchronizing release.
Consume says that everything
that comes after it,
and it depends on the
values that was loaded,
we see everything that came
before the synchronizing
release.
So for example, it will work
here exactly like an acquire,
because we are
reusing the same key.
If it was another
variable without any kind
of the difference, it would
be exactly like a relaxed.
It would not order anything.
So if this was a good idea,
[INAUDIBLE] because it
is really cheap on the
hardware, because [INAUDIBLE],
for example, [INAUDIBLE] fence
for acquire, but not for this.
Because they automatically
track such dependencies.
However, the big issue is
that dependencies are not
defined as cleanly as a C
level, as in the hardware level.
For example, here, do we
have a dependency or not?
And the answer is that you
do, until you run a compiler
that optimizes t minus t is 0.
And this can be really messy,
because instead of just
a single in-line expression,
you have a call to some function
somewhere else, meaning that
to keep the dependancies here,
you would need to prevent
optimization everywhere.
Which is obviously
not a good idea.
So what compilers
currently do is,
they're set whatever within an
acquire we put it in a fence,
exactly like acquire.
But this means that
in many back ends,
we will need a fence
where the in-line assembly
approach would not.
And this is why, for example,
the Linux channel is not yet
using atomics.
Because this kind of thing is
a very common pattern for them,
and they don't want a fence.
And they don't want a compiler
to optimize their dependencies
away.
AUDIENCE: How does a consume
interact with a function call?
How does a consume interact
with a function call?
ROBIN MORISSET: How
does a consume interact
with a function call?
So in theory, you are supposed
to try to dependencies
throughout the function.
And in a practice
Linux [INAUDIBLE]
are using their own
hand-made consume assembly.
This is exactly what they
do when they sometimes
have long strings of functions.
And they are carefully making
sure that the [INAUDIBLE]
and really depend on what
was at the beginning.
And so the main suggestion
for how to fix this,
is add a way for the
programmer to annotate and say,
in this function, I want you to
serve my dependencies, or even
[INAUDIBLE].
I want you to make sure
that this variable depends
on that one at this point.
And so there are lots of
the ideas on how to do that,
but nothing that really
met a consensus yet.
AUDIENCE: There actually
is this [INAUDIBLE]
ROBIN MORISSET: Yes, so--
AUDIENCE: [INAUDIBLE]
AUDIENCE: It doesn't
change the semantics,
but it's a hint that
it might be worth
devoting dependencies
[INAUDIBLE].
ROBIN MORISSET:
Yeah, so the comment
was that there is already a
[INAUDIBLE] dependency in C11,
which has to do
with exactly that.
But it has a value [INAUDIBLE]
on your [INAUDIBLE],
and you can't fully say
exactly what you want.
And so compilers can freely use
it as much as they would like.
So in short, I hope that
I have convinced you
that atomics are much more
portable than in-line assembly
for this kind of lock-free card.
And that they can be well
optimized by compilers,
even if it is somewhat tricky.
But that there are still lots
of open questions to answer,
[INAUDIBLE].
Thank you.
MODERATOR: All
right, thanks Robyn.
So if there are questions,
you can come in front,
or we will repeat the questions.
There's a microphone here.
Yep.
AUDIENCE: So erata
for processors, there
can be lots of erata
for synchronization.
So for [INAUDIBLE], dmb
sy is like the only thing
you can do to basically--
[INAUDIBLE] compiler
optimizations that you've done?
Like do you have to
generate two listings?
ROBIN MORISSET: So the question
was about trickier processors,
which only on some kind
of expensive barriers.
And so currently, yes, we
only apply the optimization
for some specific fences.
And so if, when met with
a more expensive one,
we are basically saying,
OK, don't do anything,
I am conservative.
But it is not an
essential issue.
It is just that no one really
bothered to optimize that case.
MODERATOR: All righ.
Well, that's it.
Thanks for coming.
[APPLAUSE]
