- Okay so let's get started.
Welcome everybody, my name is Juan.
I have plenty of bad
ideas and today I'm going
to present one of them.
Which is Easy::jit, it's a
just-in-time compilation library
for C++ codes.
So this is actually a library
for runtime code generation.
It's not a virtual machine, it's not going
to magically optimize your code
and appear out of nowhere, an instrument
and start doing its thing.
It's not a read-evaluation-print loop
and it's certainly not the building blocks
for building something
from higher level like
a LLVM does, this is simply a library
for runtime code generation.
And it's got some
constraints on it wanting
to, it has to be a type
safe wrapper around
the LLVM so you are not going
to be managing function pointers
and cutting things out, in
and out, and stuff like that.
It has to be easy to understand
and it has to be clear and controllable.
So it's a library that has one function
and one abstraction in total.
It's also...
The front end, the language,
has to remain unmodified but
I actually need some help
from the compiler to do this so I have
to plug in inside the LLVM.
I will later show you how.
And finally this is a hobby.
This is not my work, if I can't do it,
I can't concentrate during work time.
So it has to be fun and that's a priority.
If it's not fun, maybe choices that I took
for this library are
not, regarding features
are really just I thought
it was fun to implement it.
So why to implement easy::jit?
Have you ever used just-in-time
compilation library
for C++, can you raise your hand?
Exactly that, exactly why.
Typically there are not many
just-in-time compilation libraries
for C++ and normally you
need some compiler knowledge
to start diving inside them.
So let's see an example to understand more
or less how this library works.
So here we have an image kernel
this is a classical kernel computation.
The outer free loops will scan all
the pixels on an image, the inner loops
will scan the neighborhood of a pixel
and will perform some computation.
If by any chance we knew the values
for these parameters, the compiler may
be able to further optimize this code.
But typically a compile
time, you don't know them.
Furthermore, if we knew the values
that mask takes that
this already contains,
we may even be able to go further
but the compile time again,
it's values are not there.
They are there at runtime.
Now let's consider this implication
of the kernel function.
I don't know here if you know std::bind.
Are you familiar with std::bind from C++?
I have seen some heads doing like yes.
That's so cool because
normally nobody knows about it.
Okay so this function
has to divide from this,
from the library.
We'll return a function object
and this function object, the co-operator
of this function object, will
call the kernel function,
will pass mask_size and mask_area
of the first they're gonna third parameter
to the kernel function.
Then the first parameter
of the co-operator
and the second parameter
of the co-operator
are going to be forward as
first and fifth parameter
from the kernel function and so on
for the rest of the parameters.
And the result is a function option
that we can call and it will
equal into this previous code.
Okay how does easy::jit look like?
Like this, it has one header file.
Easy::jit.h
And it has one function, easy::jit.
It tries to mimic most of the behavior
of the std::bind function.
And the return object, it
wraps, it's a function object
that wraps a lot of variablen stuff.
When you call it, it's going to perform
the appropriate cast on everything
and it's typesafe and it's going
to code and this point is one
that we call LLVM just-in-time compiler.
We generate the optimized
code by using the values
of mask, mask_size, mask_area
and so on for the optimization.
And we generate some optimizing.
And why?
Because it gets performance sometimes.
So for example, in this
case, the jit function,
will perform four to five times faster
than the original version of the code.
So on a video stream,
that may be really cool
because on a video stream
the image they made,
the frame dimensions really
do not change that often.
Typically never.
Okay so how do we achieve this?
So let's see an overview of the library.
I said that this library
needs some compiler help.
So you grab your C++
code that calls easy::jit
and typically you compile it for a sample,
well actually you have
to compile it with clang.
But you have to use a
special clang plugin.
Where it's not a plugin on the front end,
it's a plugin on the middle
end, on the optimizer of clang.
What this plug in does,
that it will identify
all the calls to easy::jit,
it will see which functions
are passed as parameters or
may be passed as parameters
to the easy::jit function and it's going
to embed the bitcode of these functions
in the final executor role.
Why, because we will need it at runtime.
At program startup, so when the,
before the main function gets called,
we are going to register
all the function pointers
with the associated bitcode.
Then when the easy::jit
function is called using
the function pointers,
we are able to recover
the bitcode implementation.
We are able to replace the parameters
by the actual values of the
things that we are passing
to call the LLVM just-in-time compiler
and generate optimized code.
We even have one optimization
that it's able to for the bitcode calls
for the built in locations,
sometimes you are able
to digitalize the code
and insert the actual,
laying the actual function
that you are calling.
Which can be performance sometimes.
This bitcode, the
bitcode that is generated
it's compiled by the LLVM
and then all of this,
we wrapped it in a function object
that is opaque for the user
so the user does not need
to know that there is LLVM
behind it except one compiling.
And that's it.
And this library has two big components.
It has the plugin which is not that big
and it's got the C++ library.
And I said the C++
library has one function.
So let's see more or less
what does this function do?
So this function, as you guys will hear,
it's got some weird macro table
that it's passed over there.
That's to help the
compiler plugin to quickly
and easily identify which
one is the easy::jit function
because the names are going to be mangled
and are not going to be easy to identify.
But it's nothing special.
Then from the parameters
that the function,
that easy::jit function
receives, we are going
to build what we call the context.
The context contains
everything that we need
to perform just-in-time compilation.
Which is the function
that we are specializing,
which are the values of
the parameters and so on.
Then using the context, we are going
to call to pass this to LLVM to perform
just-in-time compilation.
The second part is rather straight forward
so I'm going to focus on the first one,
how do we build the context?
And if we dive into the function,
in the bottom of it just as a disclaimer,
I'm really a C++ programmer
so you will see things
to correct me so after you
can tell me which ones.
At the bottom of the,
of the context generation, we
will find functions like this,
like set_param.
So for example, if the functions
are expecting an integer
as a parameter like here,
it's going to cast the input parameter,
it's going to assign the input parameter
to a, the input argument,
to our parameter of the integral type
and it's going to register
it in the context.
Similarly, if it's a place holder,
place holder like underscore
one, underscore two,
we are going to start
which is the actual index
that it's being passed because we need
to know it later when we generate the code
for my help to forward the arguments.
And let's see a case that
is a bit more fun than that.
So easy::jit can take
another assert parameter
for specialization, another
object from easy::jit
because maybe you want to compose
the things that you created.
So if it's the input,
the thing that is being passed,
it's a easy::jit function dropper,
which is the object that
wraps the generated code,
we are going to check if
the assertive function type,
if we can assign it to
the expected argument,
the expected parameter,
and if it's possible,
if it's not possible we're
going to show a fail,
a failed compilation,
but if it's possible,
we are going to register
that in the context.
On structures also we
can pass as parameters.
And if it's a structure, we
will like to ideally treat it
as if it was a string of bytes.
But we can expect that this
does not work as expected.
Why? Many reasons.
One of the reasons is that
there's padding between
the fields of the structures typically
which gives some problems and
there's another complication.
So imagine we have this pair of T
and we have this function
foo that takes two Ts
just for for examplification.
And we try to pass two pairs
of int to the full function,
then our LLVM intermediate
representation we will
have a goal taking as
parameters to integers.
Why to integers?
Well apparently on the x86_64 abi,
if it sits on an integer the structure
that you are passing and it's,
only that fits on a line,
you're going to pass them in
registers and you're going
to pack everything in the
smallest integer as you can.
What happens if it's a pair of doubles?
We are going to pass each structure
as two doubles to a function.
Okay, seems fair.
What if we are passing a
pair of pair of doubles?
I know then it's going to
be passed through memory
with some pointers.
Okay, but what's the
problem with all of this?
Imagine that someone tells you,
okay solve the arguments
for our pair of doubles.
And we have four arguments and where does
the first structure begins and where does
the second structure ends?
We cannot know it and we
have to capture this somehow.
And this is dependent on the abi.
We are getting into scary territory.
So the function that I found for this is
to introduce a function
called civilized argument
that will take the structure by value
and is going to return char pointer
with total fields flattened.
How can you implement this function?
Not in C++ but that's why
you have the compiler.
So I can implement it
on the compiler side,
it can check introspect the signature
of the function and
initialize correctly this Ri.
And we can also register to the runtime
that okay this parameter here,
it's passed us two doubles in the,
to the call function.
Or if it's pass of a
pointer and et cetera.
This last feature is currently in work.
It's almost finished, it's almost there.
I have to give it the last push.
And there's some other
stuff in the jit.h library
so for example we have some
options to control the,
how the code is compiled,
that can be useful
to say okay, compile with -O3 or -O0
or something else or to effect how
the compilation pipeline works.
We expect to extend this
as you will see later.
Why not have a code cache?
So this context object that I talk about,
it's comparable, it's cacheable.
So I implemented a cache, a code cache,
that it's implemented as
a std you know that a map,
if no template parameters are specified,
it's going to use the context
of the key of the map.
So for example if later
we recompile this way,
we try to get to jit, another function
with the same parameters,
and it's in the cache,
it's going to override the compilation
and written directly the compile object.
And for threading.
Well this is only C++,
there's nothing special,
no essential to language,
nothing really weird.
We can move the objects
in and out and all around.
So you can use your regular C++ constructs
and everything is supposed
to work, normally does.
You can also do some more fun,
you can serialize your compile function
into a string, send it
through to our server,
and the server loads this, compiles this,
and starts executing it.
Yeah, totally not a security danger.
(audience chuckles)
But you can do it.
As I said before you can compose
the generated objects
one with another one.
So here for example, the function foo
takes two integers a and b which generate
a new function foo_a_a
that only takes one integer
and we are now able to
pass the foo_a_a_a_a
to the map_vec function.
Annotate something specialized.
But now let's forget about the compiler,
the C++ library and let's talk
a little bit about the plugin.
So the plugin is actually
really, really simple
and one of the ideas is that it has
to be as simple as possible because it has
to be fun and I don't want to maintain it.
So what it does, to use the plugin,
you have to only specify one flag,
- xclang-load-xclang and
the name of the library
and you are done, nothing
special, nothing weird.
You can use your regular clang
and it's supposed to work.
Then we have to identify what the,
the main goal of the compiler plugin
is to identify which functions
are used by easy::jit
so for doing this, we
see what are the types
on the values that are passed
to the easy::jit function
are we are going to start going up
on the values that are assigned to it
and we are going to try to
discover which ones they are.
And likely they are
some problems sometimes.
So for example, if the
function is declared
in another compilation unit, the compiler
is not really able to get use,
it's coming from another compilation unit.
So it's not able to handle it so now
it's short for the programmer to say,
okay yeah this function
here export the bitcode
or you can use a regular expression
and dot star it's a regular expression
where you export everything and it works.
There is also one option in
clang called fembed-bitcode
that embeds the bitcode
of the entire application
in the executable.
I know how it works, I know what it gives,
I really haven't tried using it
because in my vision of
a just-in-time compiler
or the use of the just-in-time compiler,
there are only a few kernel functions
that you will like to
recompile at runtime,
not your entire application.
But who knows, maybe.
The other one is the best
approach, I don't know.
One thing is where this library is going.
Because it's really
early and we are young,
we want to have a lot of fun.
So one of the obvious ideas to implement
with the just-in-time
compiler is to say okay,
you provide guided optimization,
that's why, that's the advantage
that you have at runtime.
So we want to have an extra option
to specify, okay yeah, this
is my profile data structure,
create a profile version of the code,
this profile version of the code will fill
this profile data and then we can reuse
what has been profiled to reoptimize
the goal based on this profile data
on performance speculation.
So with something like this you can say,
okay yeah I can see that,
we are always calling
the same function indirectly.
Well here you have the
bitcode for that function,
inline it and you can get some
extra performance like that.
Then in some really, the
next objective will be
to create a version that
does this automatically.
I find it scary but maybe
someone is really happy
to let the just-in-time compiler intervene
whenever he wants.
Like this the function
object will decide by itself
that yeah okay, instrument now,
yeah no now execute optimize.
Okay recompile, or recompile
in a background thread
and let's see what happens, and so on.
And the final objective,
the goal objective
that me and many people may have
and there are many
directives on this topic
is to perform partial evaluation.
Is to say, okay for example we have
the evaluate function
that evaluates on AST
on a valuation of variables.
And you say, okay, you know what,
the AST is never going
to change from now on.
So it's immutable.
Generates a version of the goal
that is equivalent to compiling this AST.
And this is actually really difficult
for what I'm investigating
but I'm moving forward
on that and it's not easy.
And we need also to introduce this notion
of immutability which means that
the objective of the function
object created takes ownership
of this AST because if I
didn't inline everything,
every value that it touches on some way,
that allocates the memory the subject,
everything's going to crash and people's
going to die typically.
And many more stuff so for
example I didn't implement
the support for methods not
because I didn't want to,
it's because I'm not good at C++
and I did not find it fun but
it should be straightforward.
For function objects it
should be also similar to,
that's for methods.
It would be really nice to also implement,
to continue with the cache
on the data structures
to implement threading cache,
or a cache that supports
at least specialization
or persistence.
I haven't implemented because
I didn't thought it was fun.
And there's, yay, there's
a spinoff of easy::jit
so there is one guy who forked it
and is using it to prototype
an auto tuner for C++
and he's going to be presenting it
in two weeks for the,
during the LLVM developer meeting.
So if you want to see his presentation
and if you are there, go.
And yeah, wanna join the fun?
Because easy::jit is a fun project.
Do it as a hobby and if
you want to contribute
it would be really cool
because I'm looking to learn.
So if you really like to enter into LLVM,
easy::jit may be the project for you
because actually it's really the code
that it takes are simple,
it's really simple.
There are many low hanging fruits.
For example the profiling
and the optimization based
on the profiling shouldn't
be that difficult
to optimize, to implement,
and it's something
that's fun, at least for me.
And maybe you are really
good C++ programmer
and maybe you even like
to tell us how someone
got something wrong.
I got a lot of things
wrong in my code so yeah,
easy::jit may be the project for you.
And everything is on GitHub
if you want to check it out.
Yeah, does anyone have some questions.
- [Man in Audience] How
big is the footprint
to make this portable?
- Make it portable?
- [Man In Audience] If you want
to take it to another system
that someone used clang, old clang
with it, that type of stuff.
- It shouldn't be really,
I'm not using anything architectural
or breaking system specific.
Maybe there may be some complications.
Ah I should repeat the question.
So what's the footprint
to, the question was,
what's the footprint to port
easy::jit to another platform?
- [Man In Audience] No, how
much bigger is your application
because you've included
clang and all that stuff?
- Okay, the question is how
much bigger the application
will be because you include the LLVM,
just-in-time compiler and everything?
Yeah really big because you have to shape
within entire LLVM, yeah.
That's some, I have some ideas of how
to reduce it but it's
not really easy to do.
And yeah for the moment I'm shipping
with entire, big part of LLVM
within which is not good but...
Yeah?
- [Audience Member] For what's it worth,
you can probably trim down
the target information
to be very specific to
only targets you support
and you can also
probably, if you wanted to
reduce the amount of
optimization paths that you used
for your executable, and
of course you don't have
to specific clang as part of this--
- I'm not shipping--
- [Audience Member]
But you did ask though.
- So it's not to ship clang,
it's LLVM that is shipped
and yeah those are really great ideas
that I have to try, like reducing
the amount of targets that are shipped
with the applications because we
are probably going to
shoot for the same target
that you're running.
I mean I'm probably going
to compile your code
for the same targets you are running now.
So yeah, reduce the
amount of optimizations
that will greatly reduce the size
of the, but for now it should
be around 40 megs I guess
or 30 with the library, it's
not something acceptable.
- [Audience Member] You
can get it pretty smaller
and embedded in the drivers so--
- Okay.
Yes?
- [Male Audience Member] So obviously
the most prominent application
of this kinda thing
is loop optimization, right, and one piece
of hardware that runs lots of loops
are GPUs, and the clang front end
has support for CUDA.
Have you looked at what would be involved
in getting CUDA code for this kinda thing?
- I actually haven't checked but that's
a really interesting use case.
Any questions?
Okay, thanks.
(audience applauds)
