- Good afternoon everybody,
my name's Olivier Giroux,
I am an architect at
NVIDIA, I design GPUs.
But what do I have to do with C++?
Well I'm also on the C++ committee.
I'm the chair of study group one,
which is a concurrency and
parallelism study group.
Okay.
So I chose this title of
High Radix Concurrent C++
because I was thinking about two things.
I was thinking about high
radix data structures,
you know trees with large fan outs,
and I was thinking about
high radix threading models,
groups of threads and
groups of groups of threads
that themselves involve large numbers.
Groups of, 256 groups of 256 threads
is a perfectly normal thing in my world.
So I was thinking about
these two things together and
I wanted to show you something
that I think would surprise you.
Alright so, the big spoiler is
we're gonna talk about GPUs.
You're not surprised.
Okay, so last year I gave a
talk about the design of VOLTA.
VOLTA was our big
supercomputing architecture
that we launched last year,
a data center architecture.
That we designed with C++
in mind from the get-go.
There are large parts of the design
that we completely replaced
and as we were designing
the replacements we
would occasionally quote
the C++ standard at one another.
So that's how much of an
impact or thought for C++ had.
Okay so we refer to our
architectures for compute
as having a capability level.
Here I'm saying that
VOLTA has a compute 7.0,
compute_70, capability
level and that introduced
a number of cool things
that you may or may not
have heard about.
So the fist one I'm pretty
sure you've heard about
and that's our fused MMA operations.
It used to be that fused
FMA was a big thing
as your code thanked you for FMA.
Nowadays we're really looking at fusing
matrix multiply operations to accelerate
matrix multiply much more.
So that was definitely
the biggest headliner
over our launch for VOLTA,
was our 4x4x4 matrix multiply hardware.
But there were other cool things in there
that are particularly of
interest to C++ developers.
And so two examples of that
is that we added support
for starvation free algorithms
which are gonna make a
showing in this talk.
And we redesigned the
memory model for C++ 11.
Okay but last year that may have felt
pretty distant and
academic for a lot of you.
Volta GPUs, although they're available
to the mass market,
they're fairly expensive.
And a lot of them went to data centers
or supercomputers, for
example this is a picture
of Summit the world's
fastest supercomputer,
which is built on VOLTA.
There was a big news that
was coming this year,
that I've known all along
obviously was coming this year.
The big news is, is that this
year we followed up on VOLTA
with a new architecture called Turing.
You have probably heard,
possibly heard, about Turing.
You've possibly heard that Turing includes
ray tracing acceleration
which is super cool,
and I am stoked about.
But there's something
you may not have heard.
What you may not have heard, actually,
is that Turing includes all of the
improvements from VOLTA as well.
So Turing is compute_75 and compute_75
is a strict superset of compute_70.
Now that starts to be a really big deal
for all GPU, for all C++ developers
because a lot of you will
come to own one of these.
Like you might buy one
because you're building
a gaming rig or eventually down the line
they'll be available in many more systems.
It's possible that you'll
just come to own one of these.
It's great at C++, you're a C++ developer,
you should run C++ on this.
Alright so I'm going to
tell you what you can expect
and how much fun that's
going to be actually.
Okay.
So I was thinking about this
much earlier in the year.
I knew Turing was about to
hit shelves around CppCon.
And so I was thinking, what would make
a completely unexpected demo?
Something I could show
you, and you'd be like,
wow I didn't expect
GPUs to be good at this.
So, what do people
think GPUs are good for,
by the way
this is me making a guess
as to what people think
or talking to people
to know what they think GPUs are good for.
I think GPUs are good
for many more things.
Alright so what do people
think GPUs are good for?
When people think GPUs they think floats,
little floats, big floats, double doubles,
you know double double the
scientific computation,
complex doubles.
People are thinking arrays,
like they're thinking,
I know when GPGPU came
out you would allocate
large buffers and you would
move these large buffers
in coarse grained data movements.
So I'm thinking floating point data,
I'm thinking large arrays, I'm thinking
highly regular memory
access, adjacent threads
accessing adjacent memory locations.
And I'm thinking, I'm thinking,
when I put my C++ SG1 hat on, I'm thinking
lock free algorithms.
I can't actually write
arbitrary synchronizing code
on GPUs because
if I
write a lock, for example, and you could
search for that on the web,
if I write a lock for example,
I might expect GPUs to
have problems with that.
What do people think GPUs are bad for?
Well have you ever heard of anybody
doing string processing on a GPU?
Probably not.
What about node-based data structures?
So
trees, large trees.
What about random memory lock?
So I go to memory and it's
just a total scatter shot,
there's like no not going
for adjacent locations.
Do you have a question?
- [Audience Member] No I
was merely hand talking.
- Oh you thought of strings?
- [Audience Member] Yeah all of those.
- Oh good.
- [Audience Member] We're
three for three so far.
- Good, good, good, how about
starvation-free algorithms?
Algorithms that, for example,
polling on a memory location
with potentially contention
with other threads
and you're waiting for a
thread to set this location,
but you don't really
know which other thread
is setting this location.
It might be in the same group as you
or in a different group.
I think most people think
GPUs are bad for this,
thanks to the peanut gallery.
(audience laughs)
So I'm thinking to myself
let's build a trie,
let's go build a trie,
alright let's do that.
So what's a trie?
So a trie is a no base data structure.
And I most often refer
to it as a radix tree.
So it's a tree with a large fan out,
not a binary search tree,
that encodes,
that is essentially a map where the keys
are not stored in the tree.
The keys are the location in the tree.
So for example, if I were to insert,
if I wanted to make a map
of strings to something
I would have a root node,
which is an empty string,
then I would have a first level of nodes,
which is indexed by the
alphabet of the first letter
in a word, and then after
that there'd be another node
with the second letter of the alphabet,
and then another node
with the third letter
of the alphabet, and so on.
So radix 26, that's high!
Hence I got a talk title.
And like I just said you can
implement a map, conceptually,
a map of strings to something
using this data structure.
And in this demo, in
this talk, that something
is gonna be a count.
We're going to compute a word
frequency over a text corpus
by inserting into a trie data structure
and incrementing the count as we find,
or insert nodes into the trie.
Okay.
So this is sort of an interview
question type of problem.
What might our code look like?
So here's a summary of
what my data structure
will look like, and how I index it,
and after this we're gonna
look at the algorithm.
This isn't supposed to
be a trick question,
it's not supposed to be complicated,
this is just a sequentially
building a trie.
So here I've got one
type of structure, trie.
Each node of a trie is itself a trie.
I have 26 down level references
that are each one a
pointer to another trie.
And each node gets one of these counts.
So this helper function
here is going to take
some character, not a Unicode
character in this case,
but it's gonna take some
character and it will compute
the index
into the fan out of the tree.
And I'll return minus one when it's not
an alphabetical character.
It's not very complicated, I
am not looking to make a thing
of beauty of this, it's just
a simple version of this.
Okay how do we make a trie?
Well this going to be sort of the heart
of my code base here,
the make_trie function
takes a trie to insert into.
For this example it's
going to start out empty,
but actually for the
algorithm it doesn't need
to start out empty, you could
think of it as append to trie.
I'm going to make a bump
allocator, which is the pointer
into some preallocated
range of trie nodes.
I'm taking allocation out of this problem
we're only gonna look
at the main algorithm.
So that's gonna be my bump allocator,
a pointer into a range of free nodes,
and then begin and end
into a character sequence
which is going to be the text corpus
that we'll find words from.
Okay.
The algorithm itself is
not super complicated,
basically we start at the root node
and we walk characters until we,
so for each character we pick a sub-node
of the node we're looking at.
We're going to keep going like this
as long as we don't find a space,
or a non-alphabetical character,
where the index would be minus one.
At that point we say,
wherever we are in the tree,
we increment the count
there and reset to the root.
If there is no node for
the node we're looking for
then it'll be null and we'll just bump
the bump allocator and assign that pointer
to that place in the tree.
Okay.
Not too complicated.
I wanna point out that all
of the code in this doc
is going to be available
on GitHub at the end,
so don't worry about it,
you can check it out after.
Okay so let's go run
this and this is the part
where we attempt live demos.
(audience laughs)
So I have three systems
here that are standing by
to run this demo for me.
So I've compiled, in the
sources that are available
with this doc, I've compiled trie_st,
the trie single threaded,
which is exactly the
code I just showed you.
And I can run this over a number of books
that I got from the Gutenberg Project,
this is these text
files, they come by just
you getting books from
the Gutenberg Project.
So here I just did this
on my laptop right now.
I have another system over here, run this,
it's a lower clocked system
but it has many more cores
and that's gonna become
important in the next demo.
Okay.
Alright, so on the system
that has the many more cores
that we're going to use
for the other demos,
which is the specs are down here,
it takes 68 milliseconds to do this.
One thread.
Okay well,
obviously we're gonna multi thread this,
we're gonna go to the SG1 version of this.
Okay so are we going to
convert our sequential version
into a concurrent version.
There's a couple of
things that we have to do.
One of the fist things that we have to do
is that we have to realize that we're
now going to be concurrently
accessing the trie.
And actually there's no member of the trie
that we will not access from
one thread, or with interlock.
So we're actually going
to make that count atomic
and we're going to make
the pointers atomic.
And we're going to
eliminate insertion races
by using an atomic flag.
And the algorithm is going
to be on the next slide.
There's two more things
we need to do, we need to
realize that the bump allocator
will also be used concurrently,
and so we're going to make
that one an atomic as well.
And then finally there's
something pretty fundamental
about making the multi threaded
version where I'm going to
invoke this make trie
function in multiple threads.
They need to know which thread they are.
So each one is going to receive an index
that says which thread
ID it is within domain,
which is the total number
of threads that are
gonna be running this algorithm.
Okay this part so far is pretty simple.
There's more complexity in how we go about
implementing the algorithm.
So we're gonna do it in three easy steps.
First step is there's gonna
be a part of our program
that is going to split the input range
into a bunch of strips.
We're going to do this
relatively crudely, let's say.
We're gonna take the whole
input, the whole text corpus,
and we're gonna slice it
into domain number of slices
and then each thread will go access
the index numbered slice within
the domain number of slices.
Okay.
So we're gonna do that,
but there's a trick when you do this.
Now the boundaries of
these slices for the input,
actually some of them fall
into the middle of a word.
So what we're going to do is
either as a pre-adjustment step
or as part of how the algorithm runs,
we're going to adjust
the strips start and end
to start at the next word
and end at the next word,
with treatment of the boundary conditions
where you start at the
beginning and end at the end.
Okay, not too complicated.
Alright finally we need to,
we can run the same code
as a sequential version,
except that insertions now are concurrent,
which is why we made the members
of the trie structure atomic.
And so we're gonna run code like this,
okay so this is just replacing
the node insertion part.
We're going to check
if the pointer is null,
and if it's not then we're good to go,
we can just use that pointer,
just read it at the bottom.
If it is null when we get there,
then one of the threads has to be elected
to go do the insertion.
And that's the test_and
step that is right here.
Okay so this test_end set the turn ends
if you're the first
person to arrive, thread
through the first thread,
to arrive to insert the node
and that tells you to go do it.
That sends you over here,
so that's on the false.
If the previous value
of the flag is false,
then you take this
branch, and you're the one
who's the first one.
And then you do essentially the same thing
as the other slide except
just atomic it flat.
We've used to increment and then assign
so we do this again, we
fetch_add and then we store.
Okay.
Now when you restore to
the pointer that will tell
the other threads, the one
that lost the test_end set
and went into a spin loop waiting
for it to become non null,
that permits those threads
to make more progress now.
This is where this becomes
not a lock free algorithm
but a starvation free algorithm.
And that's pretty important because
it was very natural for
me to make this step here.
It was very easy for me to deal with the
racy insertions into the trie
by using the spin lock approach.
The spin lock approach
is not a major concern
because we know that, first
of all, the odds of reaching
the spin lock are very
low and then when you
reach the spin lock you know
for sure there's another
thread that going to set this
pointer to non null very soon.
So it's not really a
question of performance
or a question of system load,
I'm not very concerned about
the presence of the
spin lock on that basis.
But,
I am concerned, both as a GPU architect
and as a member of SG1,
I am concerned that now
this algorithm is fundamentally different
from other algorithms I
could have chosen here.
Now it's a starvation free algorithm,
it requires a benevolent scheduler
for this application to make progress.
And you can learn all
about that by searching for
on that nature of progress.
Okay.
We're gonna talk about this
again, a little bit later,
in the context of GPUs.
Okay so let's just go and run this.
Well I can run this on
my laptop right here.
There's a couple of
things that just happened,
no my laptop does not have 40 cores,
we're logged in somewhere now.
So I can run this on the same system
where I ran the single threaded version.
There's two things that you can notice.
Well one is that actually if we run the
multi threaded version in
single thread we slowed down,
and that's to be expected.
It's to be expected for a few reasons.
One the code generation is not as good
as the original code
generation because now
we have atomics and we
have releases and acquires
and there are fewer optimizations
that the compiler could do.
Still can do many, but not all.
And in this example here,
actually, we did spin a thread.
We spun up one and then we joined it,
but we did spin up a thread.
So we slowed down in single
threaded execution mode
but we net a speed up,
so that's pretty cool.
That's pretty cool and the results are not
completely repeatable, you
can see that it bounces around
and I guess here we could
try it on my laptop.
So interestingly my
laptop and that machine
sort of solved the problem
in about the same time.
(audience laughs)
Well I mean no but there's a reason,
I mean that other machine
that other machine is
a large Xeon so that's
of course it's clocked lower,
of course it's got to be
fewer memory system,
but it's clocked lower.
My laptop here probably just boosted
to it's max clocks for
a short amount of time.
Okay.
So in some run,
the stability of the run
is a little bit annoying,
but in some run I got these numbers.
And we did net a speedup,
you saw the other numbers,
so you could see what
the nature of the speedup
looks like.
So we did net a speedup,
I mean we didn't quite
strong scale here but we got
a speedup so that's good.
I feel like the small amount of effort
I put into this paid off.
Okay.
Cool, cool.
Alright, so we checked
we checked 10's of threads.
How do I get to 100,000 threads?
What do I have to do
to
run at that scale?
Well yes,
yes you answered that question,
GPU is the answer, yes.
Okay but concretely in
code what do I have to do?
In the previous couple of
slides I showed you some code.
Yes we could have spent
more time looking at it
but I need to press on.
So what's the CUDA code here?
Do you all know CUDA
right now, I don't know?
Oh well that guy does, that guy does.
Oh no you don't need to all tell me.
(audience laughs)
Many of you know CUDA, but
I'm up here and I'm thinking
oh actually, you know
it's really not a given.
So here's a crash course,
here's a crash course of it.
You're right now, are
going to take a 10 or 15
minute crash course in CUDA.
Are you ready?
Let's do it!
Alright so what's CUDA?
CUDA is an integrated, heterogeneous,
parallel programing system for C++.
It's really important to
know that it is not C++,
it's a programming
system that includes C++
as one of the elements.
And it certainly is the central element
that it is built around.
What else is important to know?
Well I said heterogeneous, well that means
that it accounts for both the host's CPU
and the accelerator, the device.
And what do I mean by integrated?
I mean that you write one source code.
You don't write a host program
and a accelerator program
and then pipe the inputs
and outputs from one to the other.
That's not how CUDA works.
In CUDA you write
essentially one code base
it's all in C++ and all
the functionality interacts
in a tight integrated way.
Okay so what's the programming
abstraction to this?
What does CUDA look and
feel like, in short?
So the model is that
there are some functions
in your program, that you're going to call
with very large numbers of threads.
And when I say practically unlimited,
what I mean is that when
you launch threads on CUDA
you will give it sort of an index space,
that could be one
dimensional, two dimensional,
or three dimensional.
And in the case of three
dimensional it's like 64 bits,
three dimensional.
So the number of threads
that you can express
is extremely large.
Obviously,
the machine cannot run a
64 bit number of threads.
You can't do that.
You could for all practical
purposes map the index space
of your problem to CUDA.
You do not need to be
in charge of dicing up
the different threads and
manually managing the threads.
If you have a problem
and your problem measures
100,000 by 10,000 you
can express that to CUDA
and then CUDA will launch the tasks.
And they run for the amount
of time that it takes
for the computation to finish.
And obviously GPUs being used in HPCs
some of these computations can
take a long time to finish.
So select functions,
not all the functions.
Select functions,
that you select, that can be called
by practically unlimited
numbers of threads,
in groups of up to 1024, you
decide how they're grouped,
but the maximum group size is
1024 with some constraints.
Which I'm gonna replace by
details, with some constraints
and if your program stays
within the constraints
then your program will
see consistent memory,
at all times.
So what are those constraints?
So in CUDA 1.0, which is the
model that a lot of people
initially heard about that you cudaMalloc,
and then you get a
different, a separate pointer
the CPU can't touch
that, and you cudaMemcpy
and you have to memecpy in
and memcpy out, and all that.
That's back to CUDA 1.0 but
from a model perspective it's
parallelism,
which is the absence of
interdependencies between tasks.
So you express a number of tasks
and you vouch that they are independent,
'cause that's parallelism and asynchrony.
Where you express these tasks,
not only are they independent
between one another but a
dependency on these tasks
need to be expressed explicitly.
So therefore they can begin executing
concurrently with your CPU thread.
That's between groups
and then within a group
that can be up to 1024 threads.
You get a guarantee of
concurrency, that is
these threads in fact can interdepend,
the threads within a group can
communicate with each other.
So you launch groups that are tightly knit
and can communicate.
And these groups between
themselves are loosely knit
and they can not communicate.
So that's the CUDA 1.0 model.
The CUDA 10.0 model, 10
years later basically,
is that it's
that
model from 1.0
but we've also added the ability for these
groups of threads to
loosely interdepend as well.
So that's,
we got a lot out of
mileage out of this model
because we didn't completely change it.
We added,
we clarified that they can also
communicate
with a clarified progress guarantee.
So why Parallelism?
Well it's because get fewer bugs
and higher performance just by default.
Of course when you vouch that
your tasks are independent,
it wasn't quite true, and
you wish that they would
communicate then you have to
do a lot more work, of course.
But as a baseline you get fewer
bugs and higher performance.
And then why did we
reintroduce concurrency?
Well it's because actually
if you get dirty a little bit
then you can get higher performance,
and so just like C++,
CUDA C++ doesn't judge
if you mix paradigms.
(audience laughs)
It's parallelism and
asynchrony and concurrency
and consistency and the
progress, all in some measure,
all together.
Now one thing I want to
emphasize is that CUDA C++
is always a superset of C++.
It's always a superset of C++ because
the host CPU never loses anything.
When you go to CUDA you never
need to give up anything.
All of your CPU code
still can do everything
your CPU code used to do.
It has access to all
the libraries it wants,
it has access to all the
features of the language.
Some of the features then also
extend to other processors
and this is sort of how
things are divided up.
So there are things only
the host processor can do.
The things on the left.
Throw, and anything to do with exceptions,
anything to do with RTTI, and TLS.
There are things that
both processors can do
but the processors need
to do it in isolation.
What I mean by that is you
can make polymorphic objects,
you can construct them on the CPU,
and the CPU can use those.
And you can construct them on the GPU
and the GPU use those, but you can't pass
a polymorphic object
from the CPU to the GPU.
And all three of these
are essentially because
there's some notion of
there being a pointer
to the ISO code, and
coded in these things.
And obviously the CPU
can't execute the GPU code,
and the GPU can't execute
the CPU code, so beware.
And then finally in the
third column you have
what all processors can use,
and they can use together
at the same time.
That is the CPU can construct an object
with certain features and
just pass it to the GPU.
And the GPU will understand
what that object is
and implement those features faithfully.
And that's basically the rest of C++.
And then like the point
details are at this link.
There are nine extensions
we're going to need,
nine things that are not in
C++, that we're going to need
to be able to read the demo code today.
Okay, nine things.
Those nine things are,
we're going to have some
memory management APIs,
they're a lot simpler.
If you dabbled in CUDA before,
these were added to CUDA,
these are much simpler
than what used to have.
Okay so cudaMallocManaged and cudaFree,
they'll implement a symmetric heap.
What I mean by that is you
allocate memory into that heap
and the CUDA driver will makes
sure that the GPU and CPU
have a consistent view of these.
Without you doing anything else special,
you don't need to do Memcpys,
you don't need to do
cudaMemcpys with that,
you just now that, you
do what do on the CPU,
you do what you do on the GPU, it works.
We'll use cudaMemset
because actually cudaMemset
can zero memory like a hero.
There are,
it's because there's
fixed function hardware
on GPUs that's exceptionally good,
at copying memory yes, but we're
not gonna use that feature,
but it's also exceptionally
good at zeroing memory.
So we're actually gonna
use this in this demo.
There are some decorations
we're gonna need to put
on our functions and I'm gonna show you
what to do with that in the next slide.
CudaDeviceSynchronize is how you resolve
the asynchrony of executing CUDA threads.
I said that in CUDA you
can invoke functions
in many threads, asynchronously,
well the cudaDeviceSynchronize
is the function you call
when you need to get the result.
There's the infamous triple chevron call,
that when we call a function
and put the triple brackets,
that's going to call
that function on the GPU
in that many blocks and that many threads.
And then here we're gonna
have our indices which
play the same role as in our
CPU example,
we had index and domain, these
will help us work this out.
Okay so here's an i-chart and we're gonna
dice it into little pieces, one at a time.
Okay so host device is the decoration
you put on ordinary functions.
There is a lot of code in the world,
a lot of my code, a lot of your code,
just a lot of code in the world that can
just be annotated with host device.
It just has to, not use
our TTI, and not throw,
and carefully use function pointers.
Assuming that you meet these
that can be host device.
So this is my ordinary function
that does nothing interesting.
Entry points into the GPU
have to be marked with global.
Those are just entry points though.
Global functions can't
call global functions.
The only thing that can call
a global function is the CPU
and it can only call it
with triple brackets.
So our entry point is gonna
be marked with global.
There is a gotcha you gotta be aware of,
you guys are sophisticated C++ people,
you need to know that
when you pass operands
to a global function they are assumed
to be trivially copyable,
whether it's true or not.
So be careful, alright?
We're gonna use the
manage heap in this demo.
So in this demo what I did
is I made an allocator,
this is just an allocator, and basically
there's only two lines of
it that are interesting.
CudaFree and cudaMallocManaged.
The rest is just filling out
the allocator boilerplate
and that eliminates a lot of tedium.
We can use triple bracket
on a global function,
in fact we must use triple
bracket on a global function.
Here I'm just gonna
launch into one thread.
Calls to triple bracket are asynchronous,
we need to call device
synchronize when they're done.
Because my ordinary function
is marked host device,
the host can call it too.
And so I can, for example,
just call ordinary function
on the CPU side.
That can be really
convenient, different people
have different workflows,
my personal workflow
is I tend to write all of
my solver, or my model,
or whatever it is I'm working
on in host device code.
Then I get a lot of mileage out of
running It on the CPU for a while
and then I take it to the
GPU, knowing that's it's
already kind of pretty good.
And then I do that next step
of finishing my GPU tuning,
for example, that way.
Just a way to work, it's up to you.
Okay so that's the cheat sheet,
using each of the nine things once.
Okay but that's not
actually a trie solver,
that's not what it does,
it's just the cheat sheet.
We can take a look at CUDA on Godbolt.
For example if I put
my cheat sheet program
into the compiler explorer,
I get access to CUDA now,
on the on CUDA explorer,
on the (blows raspberry)
compiler explorer, yes it starts with C.
And here I have it running with NVCC 9.2,
I can also run it with clang.
And what I see here, as the assembly,
is NVIDIAs PTX assembly.
So you can take CUDA code and put it on
the compiler explorer and take a look
at what the generated
NVIDIA intermediate assembly
looks like.
This assembly is the same
one that's documented
at the PTX manual
available on our website,
you can download it and
special shout out to the
memory consistency model which I love.
(audience laughs)
You'll need coffee to read it though.
Okay so, alright now that we've
done this we know CUDA now!
Cool.
Now that we know CUDA
what's the trie demo,
what does the trie demo look like now?
I did all the decorations, so am I good?
You know there's more decorations
but this is just the algorithm.
Am I good?
Well actually I have some
more details to give you.
The more details is that I didn't tell you
what happened with the library.
So if I revise my previous slide,
what I should have pointed out
is that everything in std::
can be used by the CPU alone
and what can be used by
both processors all the time
is the rest of the C++ language.
What we need for this demo is access
to freestanding library.
The standard recognizes
two subsets of C++.
One is hosted which means everything,
the subset meaning everything.
And the second subset
is freestanding which is
an implementation defined set of headers
but at least the ones in table 16.
And the ones in table
16 are very basic things
but that includes atomic,
which is what we need.
So what we're going to do
to run this demo on the GPU
is that we're going to use
my freestanding library.
And I put it on GitHub,
and put it live today,
you can try it out.
You should read the readme though.
Alright so now we have a
freestanding library here,
which all processors can use.
And it puts standard
definitions of symbols in
this simt::std:: mem space.
Alright so now we're gonna
look at out CUDA C++ version.
What do we do?
Well we do the nine things,
we do the decorations,
so we put host device on our solver,
and we do the other things
with the managed allocator
to allocate the memory and whatnot.
We convert our std::atomics
to simt::std::atomics
and also that that one,
and we're done.
Now we've ported it.
Okay so let's go run that and obviously
I cannot run that on my laptop.
But we can run that on
this system over here
and that will boot the CUDA driver
and this will run, thank you.
And boom there we go.
Well that's pretty cool.
That app now came up, it
ran in 163,000 threads,
and it ran on this system when cold,
about five times faster.
And when the system went
hot about 10 times faster
than what the 40 threads were
able to do on the Xeon before.
And we can also do the same thing on
a different system.
So the other one was a big
beefy server type of machine
this one here is just my
desktop at the office.
Works just as well.
Alright so let's get back to the show.
Alright so I average
all of the run numbers,
so let's say 3 milliseconds,
that's pretty good.
That was pretty low effort, once we got
a freestanding library,
that was pretty low effort.
(audience laughs)
Yeah.
Right.
Alright so 100,00 threads, check okay.
But what just happened?
I thought GPUs were terrible at this.
Okay so a little bit more
details about the performance.
Actually the computation
took even less time
that you think.
It took about 650 microseconds
for the GPU to build the trie.
And then the rest is
the overhead of getting
the data over there and
back over the PCI bus.
And some control overhead,
the ratio of overhead to
computation gets worse
with smaller problem sizes and gets better
with larger program sizes, problem sizes.
You could commonly, for
example, if you're doing
a lot of processing you would commonly
overlap transfers with computation.
So that's it amortizes out to just
paying the 650 microseconds.
But again it was supposed to be terrible,
why wasn't it terrible?
Well the reason is this
application's probably
memory-latency bound,
pretty much everywhere
it runs on CPUS.
It's chasing over memory,
it's being pretty chaotic,
it's not very predictable.
But if you throw 100,000
threads at a problem
the exposed latency
pretty much disappears.
And at that point when the
exposed latency disappears
you just become bandwidth bound,
and there's a lot of bandwidth on GPUs.
The memory bandwidth of the GPU we ran on
is something like 900
gigabytes per second.
A very expensive CPU will get you,
I suppose right now, is like
50 or 60 gigabytes per second.
It's a lot more memory bandwidth.
Now can any GPU run this demo?
No, they cannot,
And that's actually kind of why I'm here,
that's what's cool about the
launch of the new Turing GPUs.
Is that there's only a few
GPUs that can run this.
There's our large data center VOLTA chips,
there's our embedded AI
driving platform Xavier
that can run this, but it's not available
in your hands yet.
And Turing, pretty cool.
So only these will run this because
we have a starvation free algorithm.
And if you were to take
this algorithm and run it
on another GPU then what
would happen is it would hang.
Now maybe we could dig
ourselves out of this hole
by throwing more complexity at it,
then imagining a different algorithm
that is potentially
lock free and all that.
But you know what, that
would be more effort.
And a big part of my point here is that
actually that was fairly straightforward,
with a freestanding library.
Okay so pretty quickly we're gonna,
I wanted to tell you guys
something about C++ 20
because that's one of
the things I'm trying
to deliver to you guys
that would be great,
with the rest of SG1.
Okay so there's this
part which wasn't great,
we did have a spin loop in there.
Again I explained that that
spin loop is probably okay
but in general, good polling
groups are very difficult
to write and so we end up with
only terrible polling loops
in everyone's code bases.
So in C++ 20 we have a new facility
that's going to make this much easier.
We're going to have
essentially an abstraction
for polling loops called atomic_wait.
And here is atomic_wait_explicit
because it takes a memory order.
So it replaces the polling loop
with this polling abstraction.
It is paired with a notify,
which is somewhat similar
in concept to waiting
on a condition variable
and notifying a condition variable.
And the reason that we
need a notify is because
the idea is that this is going
to probably be implemented
using, say on Linux, futex,
ultimately.
Initially polling a
bit and then eventually
maybe yielding or sleeping a bit,
and then after you've failed for a while
you should probably come to full rest
into the kernel, you call futex.
But if inside of the
atomic_wait there's a path
that leads to futex then
you also need a futex wake.
And that's what's going to go
into the atomic_notify_all.
Okay so that's one thing.
There's a number of other things,
I probably don't really
have time talk about them
in great detail, but
I'll tell you one thing
about each one.
We're going to have an auto-joining thread
with interruptible waiting, that's cool.
We have have that waiting functions
that I just talked about.
We'll finally have semaphores, finally.
We'll have latches and barriers which are
cooperative synchronization primitives.
Latches are one time use,
fan in from multiple threads.
Barriers are multiple
use fan in and fan out
between multiple threads.
We have atomic_ref which
is going to allow you to do
atomic operations on non-atomic variables.
One of the cool things about
that is it will actually help
port code that has
crusty old C code in it.
That just has volatile
end pointers or something.
It's dirty, I don't like it,
I don't like volatile end
pointers for synchronization.
It's in people's code, it's a reality,
atomic_ref actually will
make these code bases
be easier to be brought
into the modern present.
Unseq is we're going to
have simple and effective
vectorization policy.
We have a bunch of improvements
to atomic_flag and whatnot.
And we have some repairs
to the memory model,
that are in flight.
Alright, so to wrap this up,
Turings are shipping by now.
You, of all people, should try them.
You should try C++ on them.
And,
if you try them, we'd like to
hear back from you definitely.
And today's code is
available at this link.
And that's it,
I think we have a little
bit of time for questions.
(audience applauds)
Thank you.
We'll stretch into the
next five minutes I guess,
the next talk probably starts at 15.
Hi Billy.
- [Billy] Hi in your
atomic_notify_all example.
- [Olivier] Yes.
- [Billy] Do you need to be worried about
spurious waits on the spinning side?
- [Olivier] It's perfectly
valid for the spinning side
to have experience spurious waits.
In fact it doesn't promise to go to sleep
and it doesn't promise to not go to sleep.
So it--
- [Billy] Well what does
that do to your trie demo?
- [Olivier] It does nothing
bad to my trie demo.
The atomic_wait does not return early.
So it can wake internally
but it needs to retest
the memory location and
if the memory location
has not been updated it
needs to sleep again.
- [Billy] I see thank you.
- [Olivier] Yeah.
- [Audience Member] We
actually have 15 minutes.
- [Olivier] We have 15?
- [Audience Member] Of the session left.
- [Olivier] Okay.
- [Audience Member] Your
hour started at 15 after.
Oh, yeah, thank you.
- [Man In Gray Sweatshirt] Hi.
- [Olivier] Hi.
- [Man In Gray Sweatshirt]
I wanna ask about like,
the CUDA's model about the
rescale synchronization
between the GPU and the
CPU side, cause I know like
the xlib r and rs and things before OpenGL
they kinda have like a
big overhead between the
steam machine synchronization
and between the GPU and
CPU, and that kinda thing.
- [Olivier] Right so
there's,
there's a lot of different reason why
various version of Direct X in the past
experienced high overhead in the past.
Like for example there
are many generations where
almost every call was a kernel call
and not just a user mode dispatch.
There is quite a lot
in CUDA that is simply
done in user mode, and doesn't
need a kernel transition.
So the efficiency is
it approaches basically
just the efficiency of
signaling over the PCI bus
and then,
then the dominant factor in how expensive
it is for you to go to the GPU and back,
is really how much data
you're referencing.
So if you need to reference a lot of data
on both sides, then that data has to move.
Ideally what would happen is you would
move your data to the GPU
and try to keep it there
as long as possible.
And if you have multiple computations
you try to keep your data there,
even if the control comes back and forth,
you try to keep the data there
so it doesn't take time on the PCI bus.
- [Man In Gray Sweatshirt] So that's the
gap that like scopes
in which control rate,
let's say this went on the GPU end,
let's say we're not CPU.
- [Olivier] Right.
Like a bash-ish that kind of feel right?
- [Olivier] So you're in
control of how the code
moves back and forth, definitely.
If you use the managed memory allocator
then CUDA is in control
of how the data is moved.
But you can, if we were
to get into a tuning
discussion there's a lot of hints and mobs
you can use
to say, well this data
everyone can touch but
usually the GPU touches it.
And then CUDA takes that into account
when it decides where to put it.
And then in the extreme end you can still
write CUDA 1.0, like you still can go back
and say, well for most of my data
I let CUDA decide what to do but for this
piece of data, just this
one, I'm gonna control it.
And so at the sort of high
end of performance tuning
you have full control.
- [Man In Gray Sweatshirt]
Okay thank you very much.
- [Man In Black Shirt]
So I have two questions.
First one is how do debug for this
and the second question is
the memory order constraints
- [Olivier] Yeah.
- [Man In Black Shirt]
So next thing you say
this all is sequentially consistent.
- [Olivier] No.
- [Man In Black Shirt]
Well most of the time.
- [Olivier] No.
- [Man In Black Shirt] Okay.
But when RM like it matters
was what I was saying.
- [Olivier] Yes.
(audience laughs)
In x86 it usually doesn't it matter?
- [Olivier] Ehh.
Okay, alright, alright.
How do you debug this?
Your first question,
to your first question.
How do you debug this,
actually you debug this the
way would normally debug this.
Which could mean different things.
If you're a printf
debugger, we have printf
so you can do that.
Of course printf from
hundreds of thousands
of threads could be very impressive.
Okay,
Okay.
But more seriously now,
CUDA comes with GDB,
it comes with CUDA GDB,
it comes with a GBD that
does everything your
normal GBD does.
But in addition to what it normally does
you can back trace, and set context,
and print variables, and jump to frames,
and jump to arbitrary threads that are
currently running on the GPU side.
It looks and feels
exactly the same as usual
and that GDB also works with DDD.
So you could use DDD visually to do it.
Okay so on Linux that's
how you would do it.
How would you do it on Windows,
well how do you debug on Windows?
Obviously you use Visual Studio,
no but like obviously.
So we're integrated into
Visual Studio as well.
So you use Visual Studio
to debug on windows.
Okay so you debug as normal, basically.
Okay now for the memory model annotations.
So first off,
now I'm gonna really
channel my SG1 chair hat.
It is not salient what the machine does.
The C++ memory model is the
memory model of the C++ abstract machine.
And you shall use it correctly,
even if using it incorrectly does not
result in failure on the box you're on.
- [Man In Black Shirt] What I'm saying is
go to Godbolt for example and I used
memory order relaxed, I get
the same exact code generated
as if I'm using sequentially consistent.
That's what I was saying, like--
- [Olivier] So not quite.
It does generate.
It has to generate
mfence in certain cases.
The compiler mappings are available on
a website from the
University of Cambridge.
It generates mostly the same code,
you're correct, it's not
exactly the same code.
- [Man In Black Shirt] But in
RM it's completely different.
So what I'm saying on GPUs does it matter?
Or is it mostly--
- [Olivier] This atomic
library does the right thing.
- [Man In Black Shirt] Okay
but performance matters
If I use the wrong, or I can basically
get better performance
if I used the correct
acquire release, versus
just stack slap ons
whether they consisted on everything.
- [Olivier] But I like to
point out that the optimizer
on x86 will exploit the
freedom that you give it
when you use the correct memory order.
If you use memory order acquire,
there's an optimization
known in the memory model
community as the roach motel optimization.
(audience laughs)
Which is operations can check in,
but they can't check out.
So if you have a critical
section and it starts
with an acquire and it
ends with a release,
operations before the critical section
can actually be moved
into the critical section.
And operations below can be
moved up, but not the reverse.
Right?
So that optimization is
a valid and interesting
optimization even on x86.
Right,
and so using the proper memory order
always rewards you with
the best performance
on every machine there is.
So yes the step is a little bit bigger
when you got to arm and power,
and definitely NVIDIA GPUs.
So yes, you have to use the right one.
Our memory model (blows raspberry)
it would take hours to really get into it.
But our memory model is
in the vein of IBM power.
- [Man In Black Shirt] Okay thank you.
- [Olivier] Hi.
- [Bald Man] So the triple angle brackets
they take two things, the blocks
and threads?
- [Olivier] Yes, they
take more things actually
they're sorta variadic like there's--
- [Bald Man] So do those
have to be concept per
and what are they?
- [Olivier] They don't
have have to be concept per
no they are as good as operands
to the function itself.
So you could have just computed some value
and pass it in, yeah.
What do you mean what are they?
They're unsigns.
- [Bald Man] What are
blocks verus threads?
- [Olivier] Oh this is
the high radix part.
So we don't present a model
of 163,840 flat threads.
That's not the model.
Threads are collected into blocks,
and blocks can have
between 1 and 1024 threads.
And then we have
waves of blocks are called
grids, but they're invocations
of a function.
So you decided do I want
lots of small blocks
or fewer bigger blocks to express
the problem I'm working on.
The trade off is that
bigger blocks consume
more registry file space.
And so you might want a smaller block
so that your threads
can have more registers
because we have a variable
registry architecture.
It's one dimension.
The other dimension is
that threads within a block
are guaranteed concurrent,
their guaranteed
to be able to cooperate on a problem.
So if you wanted many
threads to work together
to compute a result, then
you want those threads
to be together in a block.
So as you think about how you decompose
you problem, you might say,
I want 128 threads per block
with however many blocks that means
because I want more registers with that.
Or I might say I want
1024 threads per block
because the cooperation
between these threads
is where the real value comes from.
But they'll get fewer registers.
- [Bald Man] Thank you.
- [Man With Red Hair] Howdy.
- [Olivier] Hi.
- [Man With Red Hair] Now that we have
starvation free algorithms
that are available to
us on GPUs as well as
operate algorithms, what are
the kinda big considerations
that we should be taking
with regards to boosting
or sacrificing performance by choosing
between the two paradigms?
- [Olivier] Right so,
it is not generally the case
that lock free algorithms
are faster than starvation
free algorithms.
It's an intuition that
a lot of people have
but it's an incorrect one.
Sometimes the lock free
expression of an algorithm
has many more steps with
memory barriers in it.
That you don't need to have there
if you use sort of a low
contention mutual exclusion scheme.
In this case here, the
reason we really got away
with having mutual exclusion
is because a radix 26 tree
reduces contention real fast.
You know log 26 is a fantastic
amortization strategy.
So we had mutexes here and we survived
because contention was low, okay.
There are many other, there
are other cases I can think of
that have this attribute,
where I could write
the lock free one but the lock free one
would have a lot more overhead.
Okay so how would you choose?
I think you would choose the same,
this is one of the things
that I originally imagined
it would not be true,
but turns out is true.
Is you bring to it sort
of the same thinking
you would on the CPU and
it's all about contention.
- [Man With Red Hair] Thank you.
- [Olivier] Any more questions?
Thank you.
(audience applauds)
