>> I'm very pleased to
welcome Adam Chlipala
from MIT today and his
two students as well,
Ben and Clemar, who
are here, visiting.
Adam is going to tell us about
Fiat Crypto which is a way to
produce efficient low
level cryptographic code,
sort of like construction
from high level specs.
Looking forward to
hearing about it.
>> Okay, thanks. And let me say,
I'm very glad to have
questions throughout the talk,
not just at the end.
So, let's deal with
this Fiat Cryptography stuff.
Well, I'll give
the initial setting here,
which I'm sure many
of you know working
on very similar things.
But let me just give
the cartoon picture
of what it is to use
the TLS protocol to initiate
an HTTPS web connection.
So, here's the web browser,
there's the web server.
Web server has a public key
and a certificate
that have been signed by
some authority to tell you.
Whoever knows that private key
really is the company
they claim to be.
So, the web browser
wants to initiate
a connection with
the web server.
So, a key exchange protocol
is used to establish
some secret that
only these two parties know.
That is a symmetric key.
And then, we need to check that,
the browser needs to check it's
talking to the server
it meant to talk to.
So, the certificate and
public key are retrieved
and then we use
a digital signature algorithm
for the server to prove
that the party you
were just talking to to
learn that symmetric key,
well, it must be
the following company
that you were talking to.
And now, from that point on,
we can just use
symmetric key crypto,
which is generally a lot
cheaper than public key crypto.
So, every single HTTPS
connection that goes
on needs to go through
this little dance to
establish the symmetric key.
The first two stages
of that process that
use public key cryptography,
they only run once per
session but there can be
many HTTPS connections coming
into a serious datacenter.
So, it's important to optimize
the performance of
the public key part as well.
And the challenges
there have to do
with trading off between
correctness and performance.
And, in fact,
the public key algorithms
in practice are a lot harder
to implement correctly and prove
correct than the
private key algorithms.
And one of the biggest
challenges here
is doing big integer arithmetic,
modulus some huge prime number,
and that is what
we're going to try
to get right with
programming tool support.
So, how bad can it be to
do arithmetic, after all?
Well, there are not
that many different crypto
libraries used widely,
luckily, and they're built
by pretty competent people.
But they really have their work
cut out for them because there's
a whole space of algorithms that
matter in practice
for cryptography.
There's another space of
different prime moduli that
matter for implementing
these algorithms.
And then they're, as usual,
they're a bunch of different
hardware architectures.
It is surprisingly
close in practice.
It's true when I say
that every element of
this three-way
Cartesian product gets
implemented separately
with no code sharing.
And only a relatively small set
of experts on the planet
know how to do it in a way
that achieves
acceptable performance.
So, when you change the prime
number, it really is true,
the experts rewrite
the assembly code from scratch
for every hardware architecture
to get the best performance.
This is the state of practice.
And they get it wrong
a decent fraction of the time.
We found quite a few
bug tracker issues that
were related to
getting this kind of
arithmetic wrong for
different kinds of protocols.
So, there's a real
opportunity here to
improve the developing
tool story for this space.
Here's an example of someone
getting something wrong.
This was a comment in
reference C code written
by the same people
who designed the primitive
that it's implementing.
They left a comment saying,
I'm not really sure
if this variable
can be negative or not.
You should probably
go read this book,
and if you figure out
that it could be negative,
let us know, fix it here.
So, that's not exactly
confidence enhancing.
And it comes from
this very cleverly named
online benchmarking
comparison site
called SUPERCOP that has this,
of course, completely rational
expansion down
here at the bottom.
This is what everyone uses for
benchmarking cryptography code.
Another bug, by
the same people again,
with the reference
implementation,
essentially of their own
algorithms they invented,
was they read about
10,000 lines of
assembly code manually to get
the expected level
of performance.
And oops, they wrote R1 when
they meant R2 in one place,
and this meant that
the code was incorrect.
There was also a bug
in open SSL that involved
a disagreement about
the contract between
the caller and callee of
a divide with remainder
kind of operation.
And oops, the people
who found and analyzed
this bug computed that
using at least the
dumb random testing,
the chances of finding
the bugs were something
like tends the two
to negative 29.
There are a lot of corner cases
to worry about in
this kind of code,
not always obvious
what they are.
So, testing is not
super effective at
catching problems that can
have serious security
consequences.
So, let's try to do better.
Our project is
Fiat Cryptography.
Here's a cool logo made by
professional designer
in the context
of our larger project,
Science of Deep Specification,
which is a multi-university
initiative.
I don't know how many people
recognize the spy
versus spy iconography here,
it's a classic comic strip and
the spy has become a rooster
because we use
the cock proof assistant,
so it's a natural match.
So, what is this project?
Well, let me tell
the general story of how
Correct-by-Construction
Cryptography might work
and then tell you which parts
this project deals with.
So, let's say we have
some abstract security property
like if you're able to
produce a signature fast
enough then you must have
known the secret key.
We might do protocol
verification to show that
a particular mathematical
algorithm written at
a relatively high level
truly provides
that security property.
And we might do
implementation verification
or synthesis to show that
the actual performant
low level code
is a correct realization of
that mathematical algorithm.
This project is only
the second of these two steps,
though we'd be
very interested in
connecting to results of
the first kind and we're doing
some preliminary work
along those lines now.
So, let me zoom in
on the second part.
So, we have a
mathematical algorithm.
We're working in particular
on something called
elliptic curve cryptography,
which is what's
typically used for
those first stages of
TLS that I talked about.
So, we're in
algebraic structures
with two dimensional points
to start out with.
And it turns out,
that in practice,
two dimensional points are
implemented with points in
more dimensions because it is
faster and helps you
avoid the process
or instructions that
are the slow ones to
introduce extra points in a way
that is largely orthogonal to
what I'll be mostly
talking about,
so I won't say
more about it here.
But then we have basically
two functional programs,
one that uses
the simple representation,
the other that uses
the fancy representation.
And we prove that
the second one is
a correct implementation of
the abstract data type that
the the first one defines.
Then, we need to take
the individual coordinates
of these points,
which are big integers
in a modular prime field,
and we need to show that
we're correctly
implementing integers,
so that they're so
big, they don't fit
in single hardware registers.
We need to divide
one number across
multiple registers and
that's really the heart
of what we're doing here.
So, we prove an
abstraction relation about
the implementation of single
integers within the points
as multiple roughly
machine word sized pieces.
And then we take
particular choices of how
to split a number into
multiple parts and then
we specialize our library of
code to the particular
choices you've made to
get idiomatic C-like code
that is customized
to those choices.
And we also do
abstract interpretation
to infer bit widths
of temporaries and assign them
to the appropriately sized
registers and local variables.
And our compiler for
that last part is proved
correct once and for all,
whereas,
some of the earlier
stages are proof
generating instead of verified.
So, several people in
this room have been doing
very closely related things.
So, let me give
the comparison with
the most closely related
projects that I know about.
So, there's the HACL*
project which is
an F* based crypto
library using these Low*,
Vale, which Chris
has been working on
doing metaprogramming to build
this kind of code and
reason about that code,
generators using Daphne.
And there's a project called
Jasmin which is, basically,
a verified compiler for
a low level language that is
appropriate for cryptography.
So, quick summary of
the advantages of those projects
over what I'm
talking about here.
I've showed you
a few steps and said
we only do some of them.
Well, these other projects are
doing more of the other steps,
at least as I understand things,
and their results are pushed all
the way down to
assembly code in some cases.
And we are currently stopping
at a higher level,
roughly C-code level.
And also, the project I'll be
presenting finishes
up generating
just straight line code,
no control flow and
these others all
have stories for control flow.
Some of the pros of
our approach are that
we have a much smaller
trusted code base.
We're doing the usual cock thing
where we don't have to
trust SMT solvers or
compilers or any of that.
And, more fundamentally,
we are showing
complete automation
of regenerating a protocol,
or a primitive actually,
when you change,
what is the prime modulus
for your arithmetic.
It sounds innocuous,
but, in practice,
this is what implementers
spend most of their time doing.
When you change
the prime modulus,
rewrite all the code,
our compiler does
that automatically.
And we also regain
the advantage we're
used to since the 80s.
We have compilers that retargets
new hardware architectures
automatically.
It turns out, and I'll
show an example soon,
this code is completely
rewritten for
each new architecture and
the state of practice.
So, our compiler generalizes
over both those dimensions
and rebuild stuff for you
automatically when
you change either one.
Given that there are
few people in the room
whose work I'm
comparing against,
any questions at this point?
Okay. All right.
So,
>> Jasmin guys would really
disagree with you [inaudible]
>> Good thing we kept them away.
All right. So, in other words,
one dimension of variation
in code in this space is
the different prime moduli
that we're doing
our arithmetic with.
Another is all the hardware
architectures out there.
There's a whole space of
reasonable combinations of
these and we are
going to let you pick
any of those points,
automatically generate code with
a cock proof of
correctness with respect to
pretty simple white
board level math about
elliptic curves.
And we ran some experiments
I'll get to later,
but in terms of how we really
demonstrated that we have
good coverage of this space,
we've done experiments on
about 80 different primes
on this axis and
on this one we did
the usual thing of 64 bit x 86
and 32 bit ARM running
on an Android phone.
And, we get pretty good code for
almost all the elements of
this gross product space.
What does the code look like?
Well, it looks like this.
It's a straight line
code that can be
pretty printed in
C pretty easily and
it's probably not super easy
to squint at this and see
interesting differences
between the 64 bit and
32 bit versions for
a particular prime number.
But this is a lot more code
on that side as evidenced by
the smaller font size
because you need
to use more registers when
they don't store as many debts.
And, we can generate
128 bit code getting
ready for more
processors that support
those registers
directly and it's almost
no new effort probably
likewise for 16 bit code
and things like that.
Our framework is generic
in those decisions. Yeah.
>> I understand [inaudible]
the same operations will
be carrying up to almost
like forty eight figures.
>> Right.
>> It seems obvious to
me why the variables
in that 32 regularities
>>The question is, I'm
talking about
registers but I'm just
showing you C Code with
as many local variables
as you like here.
So, we depend on
a register allocator like usual.
Most of these variables quickly
become dead as you proceed
down the code and so,
you mostly do manage to use
registers for all this stuff.
In fact, the high level
algorithms are designed so
that's the case you get
the best performance.
Does that answer your question?
>> I don't see why
the 32 bit side
will be different then?
>> We have- The question is,
why the 32 bit side
is different?
Not sure which dimension you're
talking about but we have,
notice over there we have
32 bit and 64 bit size variable.
Maybe I'm the only one
who can see it,
perhaps since I'm so close.
This side of 64 bit and 128 bit,
so that's the difference. Okay.
>>How do you expect
the 128 bits and
variables to be compiled?
>> How do they compile?
So for instance here I
have a 128 bit addition of
two 64 bit instructions and we
use the compiler and transects
to get the native process
or instructions for that.
>>Perfect
>>Which is really important
for some of these outcomes.
>> So do you
expect like this to be
split among two registers?
>> Yeah. It's an addition
that goes into two registers.
We also have add with
Kerri instructions for
some algorithms and not others.
So we do depend on
a few different flavors
of processor of C
and transects that
wouldn't normally be
around. All right.
So, another dimension of
success here is that we're now
the Curve 25519
implementation in Chrome,
and working on being
likewise for P256 so
those numeric buzzwords are
the most popular Elliptic Curves
both in the TLS standards.
P256 is still most widely used.
Curve 25519 is like
the hipster curve that
if you're ahead of
the popularity curfew you
realize is the one everyone
is going to be using,
and it's a lot
simpler thin format
which has both pros
and cons in that,
it's easier for us to
do first but it's more
impressive if you get P256 also.
And we have that proved,
it's just still
being integrated into
the boring SSL library
that Chrome uses.
And it'll first be
out there in Chrome Version
64 which I think
is in development stage
now and will switch to
Beta in the next few weeks.
So, let me go back and
talk a bit more about
what really is going
on in this arithmetic
and what are the challenges.
So, like I said,
different prime moduli lead to
dramatically different
implementation strategies
to get the best performance
on the processors that we have.
One prime number that's very
important to the 255 minus 19,
it was designed by
Krypto implementation experts to
have good code on
commodity processors.
It fits a general pattern
that's been given many names,
one of them is pseudo-mersenne
primes where you
have power two minus
a small constant.
Why do we care about this?
Why is this going to
make the code fast?
Well, a key operation that we
need to come up with
is modular reduction
which is basically the modern
operator that we're all used
to from Math classes.
And the trick is,
we know a lot of
mod operators are
built into the hardware,
if you happen to work with
the power of two the mod
is very efficient to compute,
but in Krypto part of
the point is to make
things certain operations
not efficient to compute.
So we have a number that's
almost a power of two but
unfortunately
almost doesn't count
here if we're getting
good performance.
So we're going to need to
work to get mod to be fast.
So, here's an example of how
we might think about this.
Let's say I just did
multiplication or
something where the output is
naturally twice as wide
as the input operands.
And so, I can take
the output and think of it
as a higher order word
and a low order word.
We shift the higher
order word into
place by multiplying it
by the right power of two.
Now I want to take this thing,
to form the modules operation to
get it back down
to a single width.
So let's do a little algebra
and I'll subtract and add C,
that small constant in
the appropriate place
and then distribute.
And now something
nice has happened,
we have this term here
that's multiplying
by the modular so,
goodbye to that thing.
And now there's this kind of
clever thing that's
happened where we've done
a modular reduction where
all we need to do is
multiply by a small
constant and add.
And now this sum is
probably though we don't
it's not true in all cases,
smaller than the modulus.
And we've done
the mod without using
any hardware mod instructions
or anything else slow.
We just need addition
and multiplication.
So sometimes we even managed to
do it without
multiplication since
we know in C we
might implement that
as several additions
if it's faster.
So, it's really crucial
that we can do this
in practice to get
good performance.
So one thing we will
do in our library,
is generalize
this trick and make
our compiler automatically
use it where appropriate.
So that was at
the level of the whole result
of multiplication
assuming we have really big
registers which we don't.
So, let's think about
this prime number
and say we're going to
again do a multiplication,
the output of
multiplication wouldn't
actually be 510 bits wide.
And so we could represent it on
a 64 bit machine with
eight different registers.
And now let's try to
use that trick once
we've peeked under
the covers and
seen how we break
numbers into registers.
So I've written the formula
for turning one
of these lists of
registers back
into a big number.
And I've written it
suggestively splitting
the registers into
two halves and
multiplying by two to the 256 to
capture shifting
the higher order half
into the appropriate
position of the answer.
Unfortunately, that
says two to the
256 which is not two to 255,
so the modular reduction
trick that I just
showed doesn't
apply and that means
somewhat counterintuitively
this representation
of big numbers is much much
slower than the common ones
on commodity processors,
so this is not a good choice.
The representation that's
actually used in practice is,
to use 10 different registers
that only take up
51 bits even though we're
storing them in
64-bit registers.
So we're kind of
wasting bits but now
we can use this modular
reduction trick,
by the way at
various points we'll allow
the registers to grow
larger than 51 bits and we'll
carefully bound
how much they grow
with each operation
that eventually
we'll carry and reduce
down to get back to
the original representation.
So, the most widely used
64-bit representations
use this trick.
And then the 32 bit,
the way it's usually put is,
we'll take 20 registers,
each of size 25.5 bits,
and get a representation
like this one and of
course a little expensive to
get fractional bit registers
on commodity processors.
So what that actually means is
let's take a floor
or ceiling operation
in the appropriate places
to get these bit widths
to be whole numbers.
So, now we get,
not only do we
have some choice in
how many registers to
split the number into,
not only do we choose
how many of the bits
and the registers
will we actually use,
we might actually make
non-uniform choices
across the different registers
that we're using.
So this is the standard
32 bit implementation.
So at a high level
what we're doing,
is we're building
a compiler or a framework,
few different words
apply that is
parameterized on
choices like this,
and given that
any choice guarantees to
compile correct code
with a Calc theorem that
the arithmetic operations
are actually computing
the original whiteboard
level number theory stuff.
So, our library takes in
your choice of representation
in that space of possibilities,
and spits out fast
cosi assembly code at
the moment we actually
assume infinitely
many temporaries
everything else is
very assembly code looking.
And you get approved
to the library,
parts of it approved
once and for all,
parts of it are
approved generating.
So, in a little more detail
what's going on
inside that green oval,
we have a library of
Haskell like
functional programs and
Calc that implement
all the key arithmetic.
And then we use partial
evaluation after you choose
a particular digit
representation
to specialize it to
your particular choice.
Now we get flatter functional
programs that look even
more like assembly code
and then we do
bounds inference
and other compiler
optimizations to generate
the actual low level code
that comes out in the end.
So here's an example of one of
the kinds of algorithms
we need to deal with.
Let's look at
multiplication modulus
two to the 127 minus 1,
a simpler version of the prime
I was showing before.
So now, we might split each of
the numbers into
three different digits,
and they translate back
into big numbers like so.
So, one way to do
this multiplication
is just compute
all the cross terms
between S and T,
we get a bunch of
terms like this,
and I have lined them up
suggestively here because
down the columns we see
the powers of two are
very close to each other.
So, we can just add
down the columns
and produce an output like so,
and you can see in
some places the powers of
two were a little
off from each other.
So, we do extra multiplication
to compensate there.
And now, we're almost
done except we have
a double wide result here,
meaning we have five digits,
where we started out with
only three in the numbers.
We want to reduce
this down to being
just three digits using
a process called carrying.
And the way that works is we
notice the last two digits,
we can write the formula
to convert back into
one number like this,
with the last two digits are
shifted by multiplying
by 2 to the 127,
and then because 2 to the 127 is
congruent to 1 in the prime
field we're working in,
we can move those digits to
appropriate other positions.
For instance, U3 move
from all the way
over there to here,
and U4 move to the second digit.
Now, we are back in the
appropriate three digit format.
So, we write a functional
program that does this
generically for any choice of
base-system representation.
We proved that functional
program correct,
once and for all.
So, we can just specialize it
to new representation choices,
and automatically know that
you have the correct code.
But it's not as easy as
just saying that there are
some representation choices that
have a big effect on how,
or easy the prove
started, right.
So, you might think we
would want to formalize
based systems or how we
divide numbers into
digits like this.
Let's have a weight function W
that for each position in
the number representation,
tells us what we multiply
that position by to reconstruct
the original number.
So, and we'll say
the first position
zero needs to have weight one.
That's the least
significant digit.
And all the other weights
need to be non-zero.
So then when we have
a list like this,
we convert it back
into a number by
multiplying each position
by the associated weight,
just adding all those together.
So, this turns out to
be a rather hard to reason
about doing the proofs.
Whenever you are doing
a multiplication like
on this previous slide,
you get all these cross terms,
and you have to ask yourself,
is the weight of this digit
already in my base system,
or do I need to do
some sort of conversion to
get it to match what
my base system supports?
So, for instance, this 2 to
the 85 versus 2 to the 86.
One of those is
going to be not in
the original
base-system and we have
to do some special work to
convert between
weights of digits.
And other little things
like that we're
just building up in
complexity when we were writing
the code and doing the proofs.
So, instead, we decided on
a different representation
called the Associational,
where we represent numbers
as lists of pairs of
digits and this kind of
static weights that are
associated with them.
So, for instance,
if we represent
the number 234 with
the usual decimal system,
then it gets translated to
that list over there that
just has the digits.
I think I meant to put,
something is a
little wrong here.
This looks like a two
in the last position
instead of a four.
Anyway, we take the digits,
and we pair them
with their weights
which are powers of 10
and get that list.
>> [inaudible] If it was 204,
we need to have 10, 0 or not?
>> The question is if one
of the digits is zero,
do we include an explicit entry
for that in the list.
And I think we do
because usually,
the actual digit values are
variables rather than constant.
So, we don't know
what's zero in advance.
All right.
But we were going to
generalize this and allow
a lot of flexibility in
how we split numbers up
and we're going to
allow lists like
this which may even
have duplicate weights.
So, for instance, we
could take this list and
add something else
multiplied by 100.
And we're not going to fix
the weights in advance,
because an intermediate stages
during algorithms,
all sorts of strange temporary
weights might pop up,
and we want to be
prepared for any of those.
So, given a list like this,
it's pretty easy to write
a Coq functional program
that translates the list
back into one big number,
which we use to state
the correctness properties.
The actual running code
never does this.
We don't have registers big
enough to store these numbers,
but it's a very simple map
followed by a fold-right,
kind of thing which I'm
sure any of you could
have written in your sleep.
But the consequences
of this are really
great for simplicity
of the proofs.
So, for instance, we want
to implement addition.
Does anyone have
an implementation of
addition in mind for
this representation?
It becomes really trivial now.
Yeah, it's just
list concatenation.
And so, we can just prove
that the evaluation of
a concatenated list
is the sum of
the two values of
the original lists.
And this is a big
simplification over what
we need to do with
the static base system approach.
And it has a pretty
short Coq of two.
So, to do multiplication,
we just want to compute
all the cross terms
between the two lists,
and that's pretty
straightforward.
And so, we can define
multiplication like this,
take a map and
a flat map and put them
together in the right way,
and you can prove a theorem
like this one that
multiplication of this kind
really corresponds to
multiplication of the usual kind
with pushing evaluation
through the operation.
And that has a pretty
short proof as well.
And really, all the other
arithmetic operations
we need go similarly
smoothly to this.
But modular reduction is
a little more complicated,
not too much more.
So, remember I talked
about this trick
where we reduced
the modular reduction operation
to just multiplying and adding.
And we can generalize that to
the full space of base systems,
where we need
an operation called split
that takes one of these lists,
and breaks it into the terms
whose weights are above
a threshold and
below a threshold.
And that turns out to
be easier to write to,
we just use a classic list
partition operation that checks,
it looks at the weights,
and sees which ones a zero
the amount we're splitting by,
and then keeps the high part
of that partition the same,
and then reduces
the low part where
everything is divisible
by this weight.
Actually does the division,
and then we're good there,
and then the reduced
operation that we need,
performs the split, and then
does the multiplying and adds
because remember we implement
add with just
concatenation of lists.
So, this too is quite
straightforward.
But there has to
be a catch, right?
Because we're working with
this crazy representation,
and in the end we
want the actual
running assembly level code
to be using a fixed
digit representation.
So, the way that works is
we write code like this one,
which takes inputs
that are actually
using dependent types
to enforce,
say this tuple Z n-type is
a n-tuple of elements of Z.
And we first, take
the regular representation,
convert each of the inputs
into the associational form,
then do the computation
in a nice representation,
and then go back out from
associational to
the normal representation.
You might be worried
looking at this code
that we're going to have
run-time overhead of constantly
converting back and
forth between formats.
The idea is for a given function
that's present in
the code in the end,
we only convert once
at the beginning of
the code and once the end,
and we're going to use
partial evaluation to
remove all the overhead
of this conversion such that
our proofs are a lot
easier by the fact that,
the reasoning in
the middle here is
straightforward and we will also
need to prove that
these two associational and
prime associational
operations are correct.
They have to do a little bit of
arithmetic to
choose the right way
to convert when you
have an input term
whose weights don't quite
match the ones
you're looking for.
But we do that once,
we prove that once,
and then the other code
and its proofs
follow this form and
are pretty
straightforward to verify.
So, in other words, let's say,
we have the example
of multiply or
some other arithmetic function,
it's parameterized
over the bitwidths
of the digits that come
in over the actual values of
the digits and then it
produces the multiplication.
We think of the bitwidths as
a compile-time fixed parameter,
and the actual digits as
a run-time varying parameter.
So, we do specialization,
just think of this as
a curried function,
so that just passing it
one of the arguments,
and not the others produces
a Coq term that stands for
the representation
specialized multiplication.
And then, we just reduced
that with partial evaluation
until it gets rid of all the
overhead of the generality.
And Coq is a really
nice platform
for doing this because the
specialized stages
just partially
applying a curried function.
The reduced stage
is just calling
standard term-reduction
features that
use lambda calculus rules
to reduce until,
suddenly the code looks
like assembly, more or less.
So, these steps are not
only are they easy in Coq,
but for those in the know about
type theory based proof systems,
they basically produce
proof terms that
are essentially just
restating the problem.
You don't need to do any
explicit proof manipulations.
These steps are
built into the logic
as automatically applied for
you whenever you need them.
So, here's an example
of the end product
of that approach.
So I've defined
a particular base system.
This is the example I
gave of the popular prime
where you need to split it
into 20 different digits,
each based on alternating
between two
different bit widths,
and this function at
the top specifies that.
So a static Coq
proof call saying,
"Please find me
an inhabitant of this set,
given two tuples of width 10."
So apparently we only have 10.
I guess this might be
a simplified version
of that example.
Give me F, G which is another
tuple of width 10, where,
when I evaluate it in
this choice of widths,
we get out the same thing
as if I evaluated
the two inputs
and multiply those
together in our
particular modules.
And then, drum roll please,
I run the framework
and out pops this code.
Probably should have
introduced the bug on
purpose and asked anybody
if they could catch it.
But this is pretty nice
flat-looking code that can
be turned into
C-like code by just
introducing names for
the intermediate computations.
And all these numbers
like 38 and 19
popped into existence
at the right spots,
and no new proofs needed to
be written for this to work,
but we do get
a guarantee that this is
a correct realization
of all that.
We have some Coq tactics that
given the statement
of a proof call will
automatically do what's
needed to get to this point.
>> I'm a little confused
about something. If you
take a handwritten
implementation
to build this kind of operation
on this representation,
every operation is guaranteed.
It carefully written to stay in
this particular representation
of this 25 and half bit.
It's not obvious to
me that if you convert
to associational form
and then go
back that you will also
always stay in that form.
>> Got It.
So, the question is given
this framework
we've set up that is
so flexible about the width
of digits at different points,
how can we be sure that
the code we generate will
actually stay committed
to one representation?
The answer is the general
framework does not ensure that,
but we heuristically choose
the right instantiations of
its parameters so that
in practice this happens.
And so it's at least
true if you get
that heuristic choice wrong,
you don't get functionally
incorrect code out,
but you might get
really slow code.
So, you can even do
genetic search
through a space of
parameters and
benchmark everything,
and keep the fast ones so
that you can get
good performance
without baking in to
the formal side what
good performance means.
>> So at one point,
you have each one of
these small variables
like F2, F9, and so
on that are bounded.
>> Oh that's
two slides from now.
That's when I know the bound,
that none of these overflow,
which is important.
>> You talked about
the formal performance.
I gather that that's not at
all what you're
interested in right now.
Can you imagine
a scenario in which
you'd be able to prove
some bound nonperformance
at least given us
a limited bag of tricks or
maybe not limiting your bag?
>> So, could we someday
prove performance results?
It seems very believable
given that we're basically
doing straight line code here.
How hard can it be?
Optimality, can we prove that
there is no better
implementation of
this algorithm on
this processor?
I have thought about
doing that because
we also work on
processor verification.
So, we can look at
the Verilog code
for the processor and
try to prove that.
But I have no concrete ideas
on what the proof
strategy would be.
Sounds like fun though.
>> This is not necessary
a very optimal algorithm
carrot super and stuff,
>> Right. We do have carrot
super in our library,
but it's not being used
here I think, yeah.
I think char super
turned out to be not so bad
in this funky representation,
but I don't remember
the details.
>> Another question, when
you pick these heuristics,
do you expect to get out
the code that corresponds
to a handwritten implementation
that you're trying to match?
>> Pretty much,
yeah. The question
is will we expect to get
the same code that the experts
are writing by hand.
What tends to happen is it's
not literally the same code,
but it's a one line proven
Coq to establish
it's the same code.
The instructions are
in different orders,
which do have consequences for
registry allocation
and that sort of stuff.
When it's still an open problem,
how to automatically make
those choices in
a very efficient way.
There are no deep
differences in our code
versus the handwritten code.
All right.
There is one other catch
that shows up in
partial evaluation,
which this example is
too simple to need.
Sometimes we have
common subterms
within a big term like this.
Not so many here because
sort of by construction,
we're always getting
these cross products
of different terms of the input,
but sometimes we need to
preserve term structure.
So, let me give an example,
here I'm just trying
to motivate the idea.
Sometimes we want to generate
code with let's in it,
and have those be
preserved throughout
the evaluation
down to flat code.
So, imagine the lets
in this simple example,
we're binding variables that
needed to be mentioned
multiple times.
Though in this case, to
keep it straightforward I'm
only having each let
be mention once.
This is a simplified case of
an addition algorithm that
is just component wise adding
the digits between two inputs.
And by the way,
you might notice
this simple code
doesn't do carrying
across digits,
but actually
many real algorithms
set things up very carefully,
so that you don't need to
do carrying with every add,
and only after a few
you have accumulated
enough extra bits in
your registers that
you then do some work.
But imagine we're trying to add
two lists using this function.
And now, I want
to specialize this code to
the case of three digits.
In each input, there
are some six values
that are not known
at compile time.
And I just tell Coq,
evaluate this functional program
in the usual way.
Unfortunately, we
get this kind of
ugly thing which doesn't
look like assembly of
the lets are nested inside
of the list Coq operations.
So, the way we
get around that is
unfortunately, but it
works well in the end.
We write most of our code in
continuation passing style.
So here's the CPS version
of add lists,
where say the recursive call
is in tail position and we
build a new continuation,
and standard stuff like that.
And then, when we try to do
the same kind of reduction on
the lists of length three
with unknown values,
then it just reduces
to the right point.
So, it looks really
straightforward here.
It's kind of a pain
in our code base.
We wrote everything CPS
and direct style
improved equivalence.
And the actual normalization
only happens in
CPS to get proper sharing
of comments subterms.
It took us a long time to
figure this out by the way.
>> I'm sorry, why
does X plus Y not
get like in the continuation?
>> All right, because we've
told Coq not to do the let
in lining rule of reduction,
but to do all the others or
an appropriate set of others.
>> But could've done
that in the first one.
How does the
continuation help you?
>> Yeah, it's kind of magic.
I don't have a great
description of why.
I think this trick is
sort of well-known in
the partial
evaluation world, but-
>> You can see
the length before you go
to the continuation section
in the second version,
whereas in the first version,
you do the Coq and then you
do address the [inaudible].
>> By the way,
I forgot to repeat
the question which was
basically why does this
help? I think it
was the question.
>> Yeah.
>> And and then Klemet partly
answered the question.
And the way.
>> You can just prove by
that like in the top version
you're going to
get exactly what you
saw in the right.
You only have lists and let
at the beginning of his buddy.
And few after the [inaudible] .
>> Right.
So, okay.
I guess a short
explanation of why this
works is with
natural reduction rules,
we do all the lets
before we start
building the list, and that's
what makes this working.
And by the way, you might
worry, hey there's still
a list left in this program.
It's not a very
idiomatic C program,
but we'll actually take
this operation and compose
it with others in practice,
and all the lists will reduce
away in a larger program.
>> Is it [inaudible] to say that
the top version of the program
would require Coq to do
much like work [inaudible] and
beginning findings and then
give you the shape as you
want to get the obvious lists
in terminal position of
your programming further
reduce when using other.
>> I think the answer is
yes. The question was getting
the top vision to work
nicely in Coq would require
Coq to do some work
to hoist lets for us.
We spent a month or
more trying to get
that to work and it just
wasn't worth the effort.
So that's why we
switched to CPS.
And we were trying
to write a proof by
reflection tactic
that does that for us.
But this is much nicer
once you get past
writing everything in
CPS and proofing it once.
All right.
So, one other dimension
of complication I
should mention is,
there are two major
implementation strategies for
arithmetic in this world called
unsaturated and
saturated arithmetic.
In unsaturated,
you intentionally
put digits in
registers that have
bits to spare beyond what you'll
actually try to fit in there,
and then you have to
periodically use currying to
reduce things down to fit again.
So our library, to
support this strategy,
needs to be
parameterized not just
on digit division strategies,
but also on when to carry
and between which digit.
So that's another
parameter when we
prove correctness
for any such choice.
Though, key point here,
we don't prove lack of
overflows once and for all.
That's going to be
on the next slide
that comes later in compilation.
And then the alternative
is saturated arithmetic,
where we fully utilized to
register bitwidths
at all times. Yeah?
>> How do you deal if you
reason using intrinsics,
like you mentioned earlier,
for hardware instructions
how dangerous it is.
Do you have to reason
about other instructions
that eventually clear the flag?
Do you have to reason about
interrupts and signals or
soon that the OS always responds
every single flag
accepted as before?
>> Okay.
>> It seems like it's very
new into a very low-level kind
of reasoning that
might be intriguing,
I don't know, I'm just curious.
>> So the question
is, when we depend
on compiler intrinsics for
operations like add with carry,
do we have to worry
that our program gets
interrupted between
that instruction
and a later one that
uses the carry bit?
I think, we assume
the compiler understands its
intrinsic well
enough to avoid that.
The carry that goes into
a regular C variable,
and so this usual guarantee
should apply there.
>> Because we don't focus
the processor connect directly.
>> The ALS will just start
the [inaudible] because
it's constant for us but-
>> Well, it is worth
mentioning that
the popular C compilers often do
a terrible job of
optimizing programs
that use carry flags explicitly.
They like to spill
the carry flags
into memory all the time.
It's crazy from the standpoint
of the assembly code
you'd naturally write.
So, yeah, but that is
a compiler problem even
independently in formal methods
that would be nice to have
solved in a uniform way.
Right.
Yeah.
I probably said
everything for this slide.
Okay. So let's actually
start discharging
this obligation about overflow
and making sure
we don't have it.
So imagine in our earlier stages
we compile this expression,
which is a somewhat optimized
version of some sort of
program as for
variables x and y.
And what we do at this point
in our pipeline is,
we reify this to a syntax tree
in a type defined inside of
Coq with a formal
semantics looks.
And then we do
some constant folding
to reduce it as much as
possible while we still have
this nice algebraic style
of expression.
Then we flatten it into
a forum where we introduce
a name for every
intermediate computation.
That's roughly the level
we were looking at before.
And by the way,
there might have already been
some lets in the starting point,
but I'm just assuming
we didn't have any here.
And now, we're going to
infer bounds and this is
just abstract interpretation
with intervals,
there are a few twists.
So we assume bounds for
the all the free variables
in the term and
we're just going to
push those through
all the operations and
infer bounds for
all the intermediate terms.
And then, based on
the bounds we've inferred,
we can see, let's see C
needs at least 52 bits,
but the upper bound is
adding a few powers of 2,
but that's a relief.
That's still less
than two to the 64.
And D similar sort
of thing happens.
So we can put both of
these in a 64-bit integers.
And the main pain
for us in implementing
this analysis was
proving bounds for
bitwise operators
on negative numbers
for what it's worth.
But we had to do
something sound.
In that case, we do need bitwise
operators on
negative numbers for
popular algorithms to
get the best performance.
So we do this analysis
and it pops out some code.
And by the way, we're
generating ASCIIs in Coq
for this final phase.
But we want a pretty print
to the C. And
the way we do that is JSON.
One of the students working
on this project wrote
a script that registers
something like
a thousand different Coq macros
so that Coq itself pretty prints
its own code as C. It's
very worth black magicy.
And then we feed that
to normal C compilers
to finish the process.
So other parts that
don't have time to
go into more detail about,
but remember early on
I said we start out
with two dimensional points
for elliptic curves and we
reason about replacing them with
four dimensional points
and probably
we have other examples
that are even fancier.
We built some new Coq tactics
to justify all that,
and we proved all the
popular point formats
something like 10 of them.
And they're mostly
automatic proofs
after a few months spent
on basically a tactic that
does things computer
algebra systems do,
but improve generating way.
And we have a whole menu
of arithmetic
operations that are in
this library including finding
inverses, exponentiation,
square roots, and going out of
our strange encodings back
into standard wire
format encodings.
And several of these
have their own dimensions
of choices from small menus of
operations like
exponentiation or
square roots in particular,
there are like two different
recipes for square roots
depending on the number
theoretic properties
of your prime.
So we define both of those and
you can pick which one
you want without
having to worry about breaking
the soundness. All right.
So how did we evaluate
the actual performance
effectiveness of this approach?
Well, first, I'll
say we have about
38,000 lines of code
in our full library.
About half of that is
generic math stuff that would
fit in the Coq standard
library conceptually.
Very little work,
very little code is
needed to instantiate
to a new prime modules,
and I'll show an experiment
justifying that.
In particular, we wrote
a Python script under
3,000 lines of code that
examines your prime number
and makes it guess about
the best parameters.
And importantly,
the prime number needs to
be written in a
suggestive form like this,
where the actual way
you break it into
additions and subtractions
with powers of two,
influences how the
heuristic works.
But this is the way crypto
implementation people are
used to writing things,
so that they can
sort of look into it
and see the essence of
the right optimizations to use.
And this script that generates
the parameters is not trusted.
The worst thing that
happens if it's wrong is
the compilation might time out.
If compilation finishes,
you're guaranteed
to get correct code.
Might not be efficient code,
but you can auto tune with
many different parameters
and pick the ones that
finish and actually give
you performant code.
So, we did a little bit of that.
We needed to get a list of
different moduli to
try for arithmetic.
So what we did was,
we scraped the most popular
mailing list for
elliptic curves,
and we processed the text
and found everything
that was a prime number.
And that gave us about 80
different prime numbers.
And luckily, only a few of them
turned out to be
complete nonsense posted
by people who were new to
the mailing list and didn't
understand elliptic curves.
But those with time out when
we tried to compile them.
There were maybe like three
of those or something.
So then, we automatically built
both 64-bit and
32-bit versions for
all these primes as well
as building versions for
the two main
arithmetic strategies:
saturated and unsaturated.
And then we would just
free the platform,
we'd keep the faster one.
And so that leads
to these results.
Here are the ones for
64-bit arithmetic.
It was kind of hard to figure
out what are we bench
marking against as a baseline
because most prime
field arithmetic
is specialized to
particular algorithms,
but what we decided
to do is take
the new multi precision
integer library,
which is probably the most
popular generic big number
C library out there,
and see how much faster you
can get your code to be by
using our library versus
using this
off-the-shelf C library.
Where we put the C code,
we'd recompile it with a macro,
but a pre-processor macro,
set to the prime number.
But otherwise, it was
mostly run-time specialization
to different prime numbers.
But this library, in
the last few years,
I think did start doing
compile time code generation
to specialized things,
the module ISO, it does
take advantage of that.
So the bottom line
here is what we get.
The x-axis is the number of
bits in the prime modulus.
And then these two
other curves here
for two versions
of the GMP library.
One, where we say you
must produce
constant time code that
does not leak off secret
data values through timing,
and the other when
we say, "All right.
Feel free to use optimizations
that break constant time."
Which is not okay for crypto,
but which gives that library
an inherent advantage.
And you can see, even
for this bottom curve,
which is allowed to
break constant time,
we were ranging from
between about one and a half
times faster to 10 times
faster, in some cases.
And this is
completely automatic.
A script scrapes all the primes
and does this for you.
So no new coding work,
no new proof work.
And also, we run
the same thing for 32-bit code,
which you run on
an Android phone
and get a pretty
similar kinds of results
going up to a neighborhood
of 10 times faster
for many of these curves
compared to the GMP.
>> So what the graph seems to
indicate is that GMP is smart
only when the most of this
is about working work size.
Right. Will JSON take that?
Like around one many two
and nothing wrong around
that six et cetera.
>> That sounds right,
but then also
apparently were seeing.
So the suggestion was,
we look for four points on
the X-axis that are multiples
of the base word size.
The interesting thing is.
Let's see. You're basically,
pointing out these
discontinuity is out there.
>> One of them about 512.
>> Okay. Yeah, that can be true.
What stands out for me are
these points where suddenly
our code does a lot worse.
I'm not sure what's
happening there.
There's another question?
>> Oh, yeah.
I was just wondering
why in the bottom one
the non-constant times
are slower.
>> Yeah, that's kind of weird.
We're not sure why that is.
It's probably just, so
the question is why are
the non-constant time ones
often faster or slower than
the constant time ones?
Probably just because GMP
isn't tested on
Android phones too much.
We had to work hard enough
to get it to build that
I can sort of believe
that isn't happening much.
We couldn't get C++ version
to build for instance,
within our patient's budget.
All right. So this is,
like, all the primes we
found on this mailing list.
There are also a few primes
that are really
important practice,
where we wanted to compare
against not just GMP
but also the best
assembly code out there.
So here is part of the
Curve25519 codebase you'd need
for TLS 1.3 for
the preferred cipher suites
at least. Here's our code.
I'll explain why there
are two lines for it.
We're basically listing
these different
implementations by
how many CPU cycles they
take and our code
is doing all right.
The generated C, which is
the most literal
interpretation of
the pipeline I've
been talking about,
was something like 20% off from
the best assembly code
to start out with.
Now, what we did to try
to understand what we were
missing to be faster
is we looked at,
I think it was this C program,
which is the best
known C program.
Or maybe we looked
directly at the assembly,
I'm not sure, and then,
took the code that someone
else had written and translated
into CalK in the same form
as a functional program,
and then proved that that form
is equal to ours extentionally.
And so that got us,
like, 10% better performance.
And our conclusion
from all that is that,
what we're really missing is
good instruction scheduling
and register allocation,
which the experts are doing
by hand as they write this code.
So that's one of our directions
for future work to
build a compiler that does
those in the right way.
It turns out Clang and GCC
actually do really bad
register allocation
for this kind of code.
And you can get
a factor of two in
performance in many cases by
handwriting the
assembly code and doing
instruction scheduling and
register allocation yourself.
But it would be nice to
have a compiler for that.
We also measured performance
of our P256 implementation.
This is the most common
elliptic curve in TLS,
significantly
harder to implement
efficiently on
commodity processors,
which is why the other one
was invented.
But here, you can see
we're like, roughly,
twice the number of CPU cycles
versus the best
known assembly code.
And on this slide
and the previous one,
like everyone else who
works in the space,
we at least beat
the performance of
OpenSSL cross-platform C code,
which sounds impressive
at first until you realize
no one uses this code in
practice because it's too slow.
But we're at least reasonable
against that baseline.
All right.
So the next steps you
want to tackle here,
one I was just mentioning is,
some of the last phases
of compilation
really deserve
good compiler support
even if we forget about proofs.
Namely, we want better register
allocation and
instruction scheduling.
So we don't have to hand-modify
the last stages of what
pops out of our compiler.
And we think it's okay to write
some sort of really
expensive, maybe,
random search-based version
of this that you let
run overnight for your
important crypto-algorithms.
We'd like to expand our scope
to more of the crypto
primitives from
TLS and other protocols
and, in general,
try to do more of the parts
of TLS inside CalK with
code derivation-based
strategies as
opposed to write-the-code-and-verify-it
strategies.
And there also seem to be
good matches with
other kinds of crypto,
especially emerging kinds like
lattice-based cryptography
or, apparently,
there's still a lot of
experimentation about
how we should use lattices
and crypto primitives.
So it would be nice
to have a compiler
that takes each
such new idea and
automatically builds fast code
for all the platforms
that are out there.
And I think I'll stop
here with our GitHub URL,
where you can download
this and give it a try.
Thanks.
>> So why do you wanna
search for a good
register allocation?
Presumably those
rock star coders are
not running searches over
spaces in their heads.
They have heuristics in mind
and they're applying
those heuristics.
Presumably, if you talk
to them and learn
these heuristics,
you apply them just like
you're applying
other heuristics.
>> We tried that
a little bit so far.
Well, maybe we should try more.
The question was, why
don't we just talk to
the people who are handwriting
the code to get the heuristics
out of their brains
that we've put into
compilers for the register
allocation and similar stuff.
I don't think the experts
have the level of
introspective knowledge to just
tell us what the algorithm is.
We haven't been successful yet
at extracting it but
we'll keep trying.
At the moment I'm
betting more on
random search based approaches
being effective.
We'll see. Yeah.
>> Can you say a word
about vector instructions?
>> Vector instructions
have been on our list to
support since the start.
We don't know how
it'll work exactly.
We'd like to do some
manipulations at the level of
functional programs that find
appropriate units
to group together,
but no concrete insights
to share at this time.
It is important to get
the best performance on
particular processors but we
haven't gotten there yet.
>> Do you have
any insights as to
why the Intel
compiler is so much
faster than your compiler
and your licensed compilers?
>> I think it might be
handling of Carry flags.
GCC likes to spill
Carry flags to
the stack and the Intel
compiler is reasonable.
>> So there's really
two things going
on here based on verified part,
which is, of course, I
think it's very impressive.
There's also
the automation part,
if you're doing things
automatically that
people are doing manually.
So I'm a bit
curious when people
do this manually,
I think they do it manually.
Like, why don't they run this kind
of program to get [inaudible]?
>> So the question is,
why weren't the experts
already scripting the work
that we're doing automatically?
Even without the proofs,
it's still a big advantage.
And the reason is
they didn't think it
could be done, I guess.
When we talk to
the people who've
designed all these protocols,
these primitives, they're pretty
surprised that you
can generalize.
It's just a functional
program that
embodies all the algorithms
they've been figuring out.
>> They can do that. It
seems completely considerable
that you could do much
better than they do.
Just saying that you can unroll
the loop ten times and do
this over ten integrations.
But they can't do it manually
because it's too much work.
>> Well, they already do
that, even in OpenSSL,
and Vale does
that kind of unrolling.
>> [inaudible] unrolling since
the loop thing then
they can't do it.
At some point they're going
to beat the [inaudible].
>> Well, they just
write scripts that do
the loop unrolling for them.
>> But they have scripts.
>> That part works.
But as far as I know,
there are no scripts prior
to ours that are generalized
in either the Prime modulus
or the target
hardware architecture.
They get rewritten for each
one or at least they're
significant implementation
effort for each new variation.
>> So it seems like it could,
at some point, beat
the automation aspect too and
do it faster than.
>> I think maybe there's
these two points.
Maybe it's because
you have proofs,
you are emboldened to
actually search for the stuff.
Because if you just wrote
a script and you just,
like, here is
the implementation,
and it happens to be faster than
the other one on
some test vector,
how do you know
that it's actually
a reasonable implementation?
So I think that the proofs
are an enabler here.
The other point is, I think,
so Haskell Star is also
doing this kind of
metaprogramming to generate
implementations that
are parameterized by
a template for the
representation of the big name.
I think what you're doing that's
maybe more than
what "Haskell" is
doing is that Haskell drives
all the operations except
the modular reduction
generically via metaprogramming.
And then you have to provide
just the modular reduction
and you do the modular
reduction as well.
But I think Haskell has,
in Haskell's defense,
they ensure that
they always stay
in the particular template
that was provided, whereas,
given the remark
that the discussion
that we had, I think yours,
in the intermediate states,
you may actually depart from
the original template that
the implementer may
have had in mind.
>> Right. So just to repeat
both the previous remarks for
people following along at
home and this other remark.
First one was having proofs of
correctness emboldens
the programmer to try
new generation strategies and
maybe people who didn't have
proofs in the bad
old days didn't
try these strategies
for that reason.
The second remark was that
the Haskell star library
is generic over
the Prime modulus except in
the modular reduction code
and is set up
to guarantee in
the appropriate sense
that you stick in
the representation
that the implementer would want.
And it's true that our library
does not guarantee that.
And again I'll just
mention all the cases where
you're outside
that representation
only occur at
compile time and get
reduced away to cases
where you do stay
inside that representation as
long as you chose
the right parameters
but you might not have.
>> Are you planning
to go to assembly?
>> The question is, are we
planning to go to assembly?
Yes. We are writing
a new verified compiler for
a C-like language that we
are planning to
connect into this.
It's pretty easy to
do that for this code
because it's very close
to assembly already,
but we want to expand
the scope to include
loops and other things
that some of those related
projects have already tackled.
Yes.
>> But why do you need
the mailing list to
search for primes?
You could search for
primes yourself and maybe
find better ones that
anyone has even thought of.
>> We can choose
random parameters to
cryptographic algorithms
and hope we hit
on something useful. Yes.
>> Ones that are close to.
If it was up to you, you
don't think [inaudible].
>> I don't actually know
what goes into deciding
which one of these primes
is the right one to use.
But I think it can
be a huge mistake
to find the ones with
the right operations are
fast because you want
cracking your
cryptography to be slow.
>> You have to know
which ones aren't safe?
I mean there's algorithms for
determining whether
they're prime,
but there are other traps
and [inaudible].
>> The hard part
here is you need
to find parameters where
the right operations are
fast and where certain other
operations are really slow.
And it's hard to use
testing to ensure
things are really slow.
>> What where the
bad primes that
the [inaudible] were
posted on the list?
What was bad about them?
>> They just didn't
have the right,
the question is what made
certain primes as posted
on the mailing list bad.
And it was that they didn't have
the right structure for
these optimizations to work.
>> Was it far away from
the power of two or something?
>> Probably. I'm not sure.
Did you have a thought?
>> That's what I wanted to say.
>> Okay. And it's also the case
that sometimes the same prime
can be usefully expressed in
multiple different formulas.
And one formula is
best for 64-bit,
one formula is best for 32-bit.
And so we had to use that to
get the right behavior
in some cases.
>> All right. Well,
let's thank Adam.
>>Thanks.
