>> Welcome, everyone,
and good morning.
So, it's my pleasure
to welcome back
Frederik Kjolstad back to MSR.
He is a grad student in MIT.
He's working with
Saman Amarasinghe,
and he is in
the field of Compiler
and Compiler Optimizations and
High-Performance Computing.
And he has been working on
linear algebra for
dense and sparse Inputs.
And he's going to talk
about his recent compiler,
which is about
Tensor Optimization.
>>Thanks Seth.
So last, year about a year ago,
I tend to talk
about a programming
language we have called SEMAT,
which is for
computing on graphs or
sparse systems using
linear algebra.
As part of the future work,
I talked about the need for
a strong compiler for compiling
sparse linear algebra,
intenser algebra especially when
the matrices and
vectors are blocked.
I'm back today to
talk about that work.
So, we have done it
in the last year,
and we also made it
work for tensor algebra,
generalized tensor
algebra as well,
which is really neat.
And it's dense and
sparse tensor algebra.
So, we implemented
this theory in our compiler.
So, I'm going to
present the theory.
I'm implementing
a compiler called
the Tensor Algebra Compiler
which we abbreviate to TACO.
And this is joint work with
Stephen Chou and
Saman Amarasinghe
who both of them are at MIT.
And Shoaib Kamil who at Adobe,
and then David Lugato,
who is at the French
Alternative Energies
and Atomic Energy Commission.
So, there is a lot
of tensors around
everywhere these days.
You will find them
in data analytics
where you have things like;
the Netflix movie ratings,
or the Amazon product reviews,
or the social
interactions of Facebook.
And a tensor is
a generalization of a matrix
to any number of dimensions.
So, a vector would
be a zero tensor,
a matrix would be
a three tensor,
and then you could have
three tensors like this,
or four tensors and so on.
You also find them in
machine learning these days,
where both images and
these different inputs
can be sort of a tensor.
Or that operations
can also be a tensor.
So a convolutional layer,
applying that
convolutional layers can
be a tensor operation.
Or applying a fully
connected layer
can be of dense
tensor operation.
Or, it can be a sparse
tensor operation if
you are talking about
sparse neural networks.
But you also find a lot of them
in science and engineering.
Tensors made it possible for
Einstein to do
his general relativity.
And they used all over the place
in quantum chromodynamics,
quantum mechanics in general,
and they also used
in other places.
Here, I have the
finite element method.
And the key there is that,
a lot of that is matrices,
sometimes you use tensors and
these kind of
structural mechanics.
But since we are talking
about the generalizations
of matrices,
we also include linear algebra.
So, all I've been
talking about the
Parser linear algebra as well.
So we support that too.
So, let's look at this
Amazon product review tensor.
So, in the first dimension,
this tensor has users.
So different names of users
and there's many users.
It is a huge tensor.
Then it's a three tensor,
it has a second dimension
and that would be products.
So here is a different products
that users bought.
In the third dimension, you have
the words in
the product reviews.
Prior work has shown that
if you do factorization
of this tensor,
you can learn something
about what people are doing.
This tensor is extremely sparse.
It contains the whole tensor
is 15 quintillion zeros.
And that's enough that you need
about 10 percent of
all the disk space in
the world to store this tensor.
But it only has 1.6 gigabytes
or about 1.7 billion zeros.
So not so many zeros,
but for each non-zero,
1.7 billion non-zeros.
And for each non zero,
you have 10 billion zeros.
There's no way you can store
this unless you compress it.
So, you have to take
advantage of this sparsity.
We have this kind of extreme
sparse that is quite common
in data analytics and
science and engineering.
It's all over the place.
We're starting to see some break
on sparse neural networks,
but you have some
sparse in there.
They haven't gotten to
this level of sparsity yet,
but perhaps they should.
Because the human brain has
about 100 billion neurons,
and each neuron is
only connected to
about 7,000 other neurons.
That's some extreme
sparse system.
So if you can model
something like that,
we could have this kind
of sparsity there.
So, one question is
why we need a Tensor
Algebra Compiler.
And I'm going to
argue that we do
need a Tensor Algebra Compiler.
So if you consider handwriting
this different expressions
that you would need if you are
working with linear algebra
or tensor algebra,
the current approach with
traditional logs is
the we hand coded.
So, this is a very
simple expression.
It's matrix-vector
multiplication.
It's sparse metrics-vector
multiplication.
So, there is B, like
there, is sparse.
The Eigen library implemented
this operation,
they hand coded it.
Then the CSparse library,
implemented a different kernel,
that in one kernel,
it multiplies B by C
and then it adds it,
it accumulates into the output.
And they implemented this kernel
because sometimes you
get better performance
by doing the whole
operation at one go.
>> They didn't talk
about sparsity,
just specifically
the matrices sparse
and not the vectors.
>> In this case.
That will come, they
will be sparse too.
But right now it's
only the matrices,
I suppose that's correct.
So there's two different kernels
of people hand coded.
Then you have PETSc,
that figured out that you can
add a different
vector to result.
OSKI figured out that
sometimes you can get
better performance if you
also scale both sides.
Then the PETSc learn that
you can also on the fly,
transpose this matrix,
and then sometimes you want
to transpose the matrix,
you want to add the vector.
So someone wrote all these
kernels because in
different context,
these different kernels
give the best performance.
So binary expression are
not enough because you loose
performance due to
the temporal locality loss.
But then you have
different sparse
formats for the matrix.
These sparse formats
can be blocked.
They can be a variable size
blocked if you're dealing with
something like
generalized coordinates.
And because we
double compress it,
so now you have 132 different
variants of this operation.
Someone has to go and hand code.
Then the vector can be dense or
sparse which adds more variants,
both output and input and
this thing you're passing
through can be sparse.
And then you have the rest of
the tensor algebra expression.
And there's an infinite
number of them.
And even if you only consider
the binary operations,
since the tensor
can be any order,
you have an infinite number
of binary operations too.
But I'll show this
very clearly later.
That if you only consider
binary operations,
you are losing a lot
of performance.
You're loosing sometimes
asymptotically performance.
So, we have a Tensor Algebra
Compiler called Taco.
The alternative to
such a compiler is
a traditional library where
you hand code these kernels.
With that kind of
approach, you have to
choose a few expressions
traditionally.
So, you choose
the most important ones.
With those by compiling,
you get an expression.
You choose a few format with
a traditional library that
you want to implement.
Typically, CSR and CSC.
Probably both.
With the Tensor
Algebra Compiler,
you get many formats.
With traditional approach, you
spend years hand
optimizing these kernels.
And this guy has
spent many years.
With the compiler, you can
relax because the compiler
will do it for you.
This cat has a compiler.
So, we're going to talk
the rest of this talk,
I'm going to talk
about sparse code and
data to get a sense
for sparsity and how
the code looks like
when they're sparse.
Then we're going to talk
about the intermediate
representation and
code generation.
What we do to automate
generating these codes. Then
we're going to evaluate.
But first, let's look at
some sparse code and data.
So this is a fairly simple
tensor algebra expression.
You have a tensor B,
that you're multiplying
by a vector k,
so you're summing
all this dimension.
So, it's like a
matrix-vector multiplication
with one more dimension.
And the result is a matrix.
So those variables
are called summation
variables because you
sum over it. That's okay.
I and j are called free
variables because you
don't sum over them.
So they are used on the left
hand side of this expression.
So if all of these
three tensors are dense,
so you have a dense matrix,
a dense three tensor,
and a dense vector.
Then the code is
quite straightforward.
We can write this code easily.
You got three nested loops,
and you do the accumulation and
multiplication in the center
of these three loops.
But if you want to start
making this [inaudible] sparse.
So, in our system,
you can choose
better attempts to
the sparse per dimension.
So for each dimension you say,
this dimension is sparse,
or this dimension is dense.
That gives you many
combinations of formats.
Like there is some I think
16 combinations for this
and 8 combinations for that.
So if I make that dimension,
I dimensions of these,
the three tensors sparse,
so I make it sparse.
Then the code changes
a little bit there.
So instead of it's rating
of the whole domain,
it rates only all
the data structure,
this sparse data structure
in that dimension of B.
To make the next
dimension sparse,
it changes the same way.
To make this cell dimension
sparse, it changes the same way.
But if I want to make
this vector sparse,
then I have to
insert quite a bit of
code to notch
together this vector,
this sparse vector with
that sparse tensor,
in the k dimension.
So that looks very similar to
nodes joining vector basis.
If you're doing like inter-join.
And if you have a summation,
it'd look a lot
like two-way notch
that Donald Knuth would
describe in his book on
The Art of Computer Programming.
That's using node sort
, I'm just using node
showing some of the basis.
We generate this kind of code,
and we can generate
much more general code on this,
and I'll show that.
But suppose, what you
want to do, you have
two sparse tensors,
you want add them,
you want them multiply by
vector in the same kernel.
Then the code starts to blow up
and you can't write
this code anymore,
because there's all these cases
you have to consider.
And I'll show quite a
straightforward theory
for how to generate these cases.
>> I have questions about
when you say
the dimension is sparse.
Does that order matter here?
Like the order of dimensions
that you're talking about.
>> Yeah, I left that out.
You'll have to specify both,
but each dimension is
dense or sparse and
which order to store them.
>> For example, for your tensor,
if the first
dimension is sparse,
it means that
the entire matrix is gone.
If you say like, something
is missing from
the second dimension,
it means that the row is gone.
>> Yeah.
>> And then if you were to cut.
>> Yeah, exactly what you said.
If this first
dimension is dense,
like this whole slides
just disappears,
and then that goes down.
In our system, you
not only specify this,
you also specify the order
you stored your dimensions in.
So with some matrix,
you can store it
in column order which is CAC,
compressed box column, or
ROAM order which is
compressed partial.
You can specify both things.
I just left it out
for simplicity.
Okay. So let's look at this.
So that's how Spark works. Sir.
>> I'm curios.
Obviously, as a promise,
it existed for
many years, right?
And you know people
have worked on.
So what is it that's
different now,
that why people in the 90's
who were working on Fortran,
and trying to do
the same kind of,
I'm mean just curious,
how has the problem
changed over the course
of the last decade?
>> I don't that
the problem has changed.
I think its just we found
a new attack angle on it.
A lot of people worked
on the dense problems,
like the polyhedron model
and things like that.
They've made a lot
of progress there.
Some people like the Bernoulli
work from Keshav Pingali.
And some work from people in
Amsterdam tried to tackle
this part problem,
but they got stuck.
I don't know exactly why they
didn't solve the problem.
I think they got stuck
in the complexity,
they only did for
linear algebras.
I think maybe that
was the problem.
Because if you only do
it with linear algebra,
you don't see the generality.
You have to go to
higher dimensions
to see the generality.
Just like a human thing.
>> Okay.
>> And I think they got stuck in
the complexity of
the whole problem.
But we figured out
how to do this,
how to generate this kind
of code level by level.
So we do it from
one dimension at a time.
And then, you can
solve the problem
locally in that dimension.
And then, it's
a manageable problem.
You still have to deal
with the matching,
be able to solve that too.
I think they tried to solve
the generic code for
the whole thing at once.
>> I see, and then
stranded in complexity.
It is too complex.
>> Yeah. I think maybe
what changed about
the problem is that
we thought about it as
tensors and then because
its more generalized
you see that generality.
>>Start with the more general,
started representation.
>> We actually started first
just with linear algebra
and the theory got
simpler when we
generalize the tensors.
Which is kind of cool.
>> Yeah.
>> Yeah. The most
related prior work
is Keshav Pingali 's
Bernoulli work from the 90's.
He was at Cornell at the time.
So that was the
sparse and dense code.
Now we're going to look at
the sparse and dense data.
So this is the same pretensor
but I unrolled it
from the three matrix slices
are shown here.
So if this is a dense format,
I lay it out in memory linearly.
And notice that these
coordinates form a complete tree.
And that is important because
since they form a complete tree,
I can locate them element.
I have random access.
I can locate the element by
just computing this formula,
which is just a
strident formula.
And they get straight
down to the elements.
So I have random access.
It's important property.
But I waste a lot of data
because all these are empty.
I don't want to store them. So I
want a compressed storage
format. What do I do?
I remove all the zeros.
I shove all these non zeros
next to each other.
Now I don't have a
complete three anymore.
So these can't be kept implicit.
We have to store
these in the cells.
So I start by storing the first,
and this is in I dimension.
Then I store this
in j dimension.
Then I store for each I,
which Js corresponds to that I.
Then I store the Ks,
then for each J, IJ combination
which k's correspond to that.
And that gives me
quite a general format.
So the compressed partial
format from matrices
has one of these sets
of indices.
If it was a doubly
compressed partial,
you would have two sets.
In tensors you
just keep going and
you add one and one
and one on top of it.
This is something
called the compressed
sparse fiber format.
We can generate code for this.
And we can generate
code with some of
the modals as well.
Standard combination.
And then you have
for each of these just which
value corresponds to it.
That's just by position.
You don't have random
access; you lost that.
But typically when you compute,
you can if you merge things
together you don't
need random access.
You just scan over
the whole format.
So I'll show that. So
if you want to locate
the value with the same value
as we located it earlier,
then you have to
scan across that row.
You have to consider
all the values in that row;
you scan across the Js
until you find J zero.
You consider all the Ks
for that J zero.
Then you scan the K's
until you find it.
You don't, so you lose sort of.
You lose random access.
You can do this
but it's expensive.
And what this format
does is that it creates
this dependency chain
through your index variables.
And those index variables
are going to correspond
to your loops.
Now you have dependency
in your loops.
This is a key observation,
and we call this
an iteration graph.
And I will show how
it works in general.
So I'm now going to go into
intermediate representation
and code generation,
which is our compiler.
This is our compiler,
so we call it TACO.
That's an acronym for
tensor algebra compiler.
The way it works
is that it takes in
a tensor expression,
any tensor expression.
It can be any number
of operands.
And for each operand
you have to say
the format that operand
That's all you say.
Then it produces c code.
It produces a kernel that
computes that expression
in those formats.
We also have a library
so we can load
data on all of
those things as well.
Internally it turns
your expression
into an iteration graph.
Then from the iteration graph,
it produces code.
When it produces code,
it uses this concept
called a merge latice.
That's a new we
concept developed to deal
with these arbitrary nodules.
Because you don't only
want to deal with
the binary nodules.
You want to deal with like
merging three operands,
four operands,
any number of operands.
So we are going to look at
how the iteration space look.
And we're going to build
up to this iteration graph
and see where the
iteration graph comes from.
So it comes from these formats,
but it also comes
from your expression.
So if you want to
compute this expression,
you have to iterate
all the combinations of I,
J and K. So it's going
to be three loops,
and you iterate
that whole expression.
So you can think of that as
three dimensions
in the polyhedron.
So now were are moving
towards the polyhedron model.
At each location
in this polyhedron,
you are going to access B,
so you are accessing B
through this polyhedron.
So it is a three
dimensional rectangle here.
But then you have
to access K as well.
And since K is
only one dimensional,
it's only one of these.
But you're going to
access this for each I J.
So you can think of
it as a replicating
this C across
the whole iteration space.
Then when you do the computation
you have to merge these
together and compute
the multiplication.
So this is the polyhedron model.
At the heart of it you have
these fine
polyhedrons like this,
but there have no holes.
Like they're
the dense polyhedrons
and compiler community has
spent many years figuring out
how to compile stuff like
this and we've made a lot of
success with that in
the 80s and 90s and 2000s.
But applications have
been moving towards
sparse representations
and the compiler community
has to catch up to that.
We have to give
them some support.
And we can think of this in
a sparse way as
polyhedrons with hole.
Now we have a sparse polyhedron.
This creates some opportunities
and some problems
that we have to solve.
We don't know where these holes
are at compile time.
But we cannot touch these holes
because I mention in
[inaudible] tensor,
if you touch these holes,
you will never finish.
So you have to
generate code that
avoids touching the holes,
but that can compute
the values where
you have non zeroes.
So I put back this
dependency chain through
the iteration variables that
corresponds to this tensor.
Going to have a similar
dependency chain for the C vector.
So it's dependency chain here,
and dependency chain
on the B vector.
When I merge them together,
I merge these
sparse polyhedrons.
I also merge together
these two iteration graphs.
So now we have this
dependency chain and you have
a merge here. So
it's collision there.
So if I want to do that
multiplication I
have to intersect
the two sparse polyhedrons
AND then compute
the intersection,
and that's this multiplication.
Then finally I'm going to
add in A which is the result.
There's a dependency
on the result too.
>> So, what this
iteration graph shows is
the dependencies imposed by
the sparse data structure,
and we're going to use
this one for code generation.
We can generate
these iteration graphs
for any expression.
It's a general
representation. And here are
some examples from linear
algebra and tensor algebra.
So this goes from the matrix
factor multiplication,
all the way down.
This is the entity character key
that we use for
factorizing tensors.
This is a blocked matrix
vector multiplication,
which turns into
a four-tensor operation.
In our representation, we just
blocked matrices of four tensor.
This more blocked is
a six tensor, so forth.
So let's look at
this expression again,
and now, we're going to
look at code generations.
So we have the iteration graph.
We generated that
from the formats,
and the expression, and
we're going to generate code.
So if the first dimension of B
of stance, then
it's very simple.
We can generate the dense loop.
So this gets back to what I'm
talking about a little earlier,
which is you can generate code
for each level separately.
So this lab, I can ignore
everything down here.
You have a recursive
code generator
argued that's just
going to record
some tests and generate
code for each level.
The first level, this is dense.
You can just connect this loop.
Second level, if B will sparse,
you're going to meet
the sparse loop.
The complexity comes in
when you have a match.
You're matching a sparse B
with the sparse C. So, now,
you have to meet this much code,
and I'll talk about
how to do that next.
And then at the center, you have
to meet the compute statement.
So this match is what we
call a conjunctive match.
It's an intersection.
You want to produce
the values and
output wherever both B,
in the third dimension,
and C has a value because of
multiplication by
zero disappears.
>> If you have the sparse
[inaudible] and say we're
anything at any dimension.
Do you sort them in order?
>> No.
>> Like the English system.
Because that is necessary
for all of this computation.
>> Alright, good question.
Yes, we do in this work.
We have future work,
so in this work,
we have the dense format for
the machine and
the sparse format.
And the sparse is
actually CSR as ordered.
Because you cannot do
this kind of large
without ordering.
But we're working on
a new work that instead of
two formats per dimension,
will debut hundreds of
formats per dimension.
And there, you can select
the properties you want.
If you are ordering,
you can do this.
If you don't have ordering,
you have to resort to
sorting it on the fly
or do something
more similar to hash
joint or something like that.
But that is for future record,
we're working it right now.
We mostly sorted that
out. Here, it's ordered.
But if this vector of stance,
we take advantage of that.
I mean, iterate over
this sparse form with an end.
Just random access this which
looks a lot like
a hash joint from data bases.
So we don't know much their.
So that's the conjunctive match.
So I'm going to look at
the conjunctive match in
more detail and show how
can generate code for it.
Then I'm going to
show how to generate
code from this junk that match,
which comes from an addition,
and then I'm going to show
some examples of how
we can do that for
any expression and
the combination
of additions and
multiplications.
So this is just an element
twice multiplication
of two vectors.
Because since we do
this for dimension,
we can consider the vectors,
and by solving the Merge
problem for vectors,
we have solve it
for the whole thing
because we just do
that recursive limb.
So here's two vectors,
your element plus multiply them,
you produce results wherever
both have a non-zero.
Carry only one have
non-zeros, you have no result.
So, if there's a sparse,
enumerated indices,
and we remove the non-zeros,
and since this is
not the full range you
have to store them.
So just store them.
Then, they want to
take an intersection
of these and these.
And the way we think about
that is using this thing.
So this is one point
in our Merge Lattice.
Merge Lattice or
ordered Lattice.
This is saying two things.
It's saying, if create
or both B and C,
until one of them
runs out the value,
that's one of
the pieces of the end.
There's also saying, at
each point produce a value,
both B and C has
a value of that part.
When one of them
run out the value,
you drop down to the bottom
Lattice point, and you're done.
So, if you own a set
of other, you're done.
See your own self
the value, you're also done.
Because it's
a conjunctive match.
So we can generate code
for this Lattice point.
It's only one Lattice
point in this case.
So we can generate a loop that
runs while both have values.
Then in addition, there's
a case saying if both
the value of that point,
I mean the value.
So let's look at how that
works for this example.
Just start by finding
the positions of
B and finally the positions
of C. It's now
initialized them here.
Let me consider whether either
of them are out the values.
C It's of values here, so
they still have values.
Then you take them in of
the C at top location.
If both have a value
at that location,
you produce an output value.
Then you increment in both
of them that you consumed.
And then, you keep going until
one of them runs out
the value at this point.
Then, this test will
fail and you're done.
And that's it. Now for
the disjunctive match.
That's an addition here.
So we have a disjunctive match.
So instead of taking
the intersection,
we have to take the union.
Then it gets a little bit
more involved.
So let's look at the Lattice.
We can create
the Lattice point that
iterate while one of them
still have obvious left.
And then if one of
them have value,
you have to add
the value to output.
Because A + zero is A.
But this is too expensive
to emit a loop like that.
So, we rewrite that expression
to this form which
is a sequence of conjunctions.
So this is the two way
Merge algorithm
describes the downward.
It describe the
downward can its part.
When we put these two points,
these three turns down in
this Lattice structure.
So what this is
saying run by both of
them have values left.
If one of them run
on the value run of
the rest of the other one
and added them.
So, to produce code for this,
to produce one code,
one piece of one loop
per Lattice point.
So, first loop runs
by both have values.
So if runs down here same
as in a previous example,
then we are out
of values and see.
So we drop down
to see and share.
And then you have to run
all the rest of values of B.
So we have make
the loop for that.
>>Yes. So, what about the order?
>> The order?
>> Like the loop.
You're disrespecting the order
of the indexes. Is that right?
Because if you do
the intersection first,
then there was that intersect
first comes first in
your story. Right?
>>Oh, no. This is running
down here and we're
going to make cases.
So that we can do all of
that just in the middle here.
So you run down like
this until here.
But you're going to emit values.
Both, where both have a value.
>> You're going to patch up,
that's why you have
the blank space.
You're going to patch it
up with the one's with
the only have one.
>> Right. It's going
to be three cases here.
One case and they both
have value, you add up.
>> That's an empty value.
>> Yeah. And those cases are
also going to be generated
from this Lattice,
because still, it's
long case Perloff is part of
the Stahmann method by the
Lattice part you're currently.
I met in called for. And since
these are dominant
by only them self ,
there's no cases because
it's trivially just one case.
>> So, you called the Lattice
point B and C is different.
This chapter merge
out of conjunctive.
>> For this one?
>> Yes.
>> Yes. Exactly.
>> It's a [inaudible] before the
previous slide and look if
you ignored about some there.
It won't say that now it's
actually because it's different.
>> There's a
difference comes from
Now I'm making
a mess. The difference
here is that you
have only one case,
because you don't
dominate all the things.
>> Just context, that's all.
>> It is, absolutely.
Okay, I am making a mess.
Okay, think I'll set it myself.
Okay, so you are at the B-loop,
and then you have to
that B-loop will iterate,
until this one gives
out the values.
And then, you could have
had the case where
this one had more values.
So you have to make the loop
for the C at this point too.
And now were getting
to these cases,
so it's going to be
one case in this loop,
for every Lattice this point
dominated by this,
so since it
dominates everything,
there's going to be
a case for everything.
The first case is if not
a value you add them.
Second case is if only B
has a value you store it.
Third case if only
C you store that.
It didn't answer the question.
OK. So that was a disjunctive
and a conjunctive match.
But as I said we can do this for
any expression and I'm going
to show you some examples.
We're not going in
details on that.
We are we going to
see some examples.
So here, I have an addition
that's multiplied by a vector.
So the Lattice looks
like this and we
have it in our upcoming
UPSA paper this year.
We have the algorithm for
producing Lattices
from any expression.
So as before you just
generate the loop for
each Lattice point
and inside each loop,
you generate cases for
each Lattice point dominated.
Here, you're adding
three vectors together.
So you're getting this rather
large Lattice structure.
But as before you just generate
code for each Lattice point.
This, by the way,
applies to databases too,
because you can use
this to generate
called for arbitrary joints.
Then, what if I
make that D dense?
Now, you're going
to get a mix of
this much code and
this random access thing.
So that turns out it's not
the the meshation of
this Mesh Lattice where you
knock out some of the points
because the second
you're done with D,
you're done because it stunts.
So it covers the whole space
so that just knocks out
some of these loops.
Simplify standard loops
a little bit.
We support any such thing,
where D's can be
dense or sparse.
OK. So that was the
>> This is maybe negation where
zero becomes one and
one becomes zero so you-
>> Like a negation,
can't we just, that
would be a- You mean-
>> But you have to
generate an output value.
That they input values
received that's-
>> Oh yeah. We
haven't done that.
For most sparse operation stuff
would be a problem that will
kill you pretty quickly.
>> We're having-
>> Yeah. We haven't
considered that case.
>> those will show
up may be I suspect.
>> I see.
Yeah, we haven't
written the paper
like putting this one
into databases.
>> Yeah but in machine
learning there's
a lot of Boolean negations.
>> So, Boolean negation.
It's interesting.
Because that's moving this to
a different summaring
than a normal summaring.
I started thinking
about that a little bit,
but apparently we
haven't talked about it a
lot. That would be very cool.
I think then I'll attempt.
I have to think
about the tough line.
How that would be done.
But I'm sure we could figure out
a way to generate
that kind of code.
You just have to run our maybe,
it's what you do is
that you generate all the values
and then you run all
our non-zero's and
knock them out.
Are there any questions
for that compiler part?
OK. Then we are going to look at
our evaluation and the results,
it's going to be pretty fast.
But before,
are you saying that
our evaluation doesn't say that.
But first, I'm going to show
you guys a demo of tacos.
>> Hiding these loops.
>> I don't know.
What's possible to it.
You can. so Tiling.
It's like a two.
There's two types of tile:
you can tile the data space,
and you can tile
day duration space,
and you can use our formats
to tile the data space.
You can do that.
Like I can turn
a Sparre sparse matrix
into sparse sparse
sparse sparse.
Four times to run,
I'll tile it that way.
But tiny duration of space with
sparse loops that doesn't
buy you much because
you have to have conditionals
to figure out whether
you hit the middle
point of the row.
For tiny dummy spaces,
that makes a lot of sense
because you have random access.
You can just run up to
the middle and jump down.
But we haven't
done that work yet.
So, like what I said our
research show we are very fast.
I mean, for sparse stuff. For
Dense stuff, we don't tile.
So we not going to be as
fast as something like
the tensor contraction engine
from Ohio state.
That's future work.
We're going to
look at it next semester.
This kind of tiling,
iteration spaces so that
we can be faster than
separations as well.
Seems I don't have
access to the Internet
>> Or you can just use mine.
>> It's okay.
I can just do it.
We have a back page
where you can generate
code in that page that can
do it with a command line to do.
So I can do for instance.
I can type any expression.
Actually I think, I'm
going to skip this.
But you can type any expression,
you can decide the formats,
and then they will
generate the code for
you like Mail's accounts.
On our back page. You
can do it in our job,
our GUI interface where you
choose the dimensions
with dropbox.
It's kind of cute
but I think it'll take
too long to set up.
You got to have connections-
>> People use that
for generating
code already or is
it just for demos?
>> Its for demos but it will
generate a C-library
with that kernel for you.
So you can use it for generating
code and they copying it
into your application.
It generates
a header file with the-.
>> Yeah. But we also have
a C++ library that can go on.
I just use like
a linear algebra library.
It's a tenser algebra library.
But if you just use it by
you get overloaded the C++,
or perhaps this is operator,
so that you can just define
your expression or compute.
>> So, how does the expression
get compiled in that case?
>> So what we do right now is
that we take the expression.
We run it through
this thing, generates C code,
store the C code,
call a system compiler
to compile the C code.
That's our Dynamic library,
DL open that library.
But we want to write the-.
I will reembark and so
you do that in memory.
But it compiles it
on the fly, yes.
So, it requires some compiler
infrastructure at runtime.
If you don't have
a compiler infrastructure,
if your luck goes on
supercomputing note,
then you can use
the offline library
to generate
the kernels you need.
>> [inaudible].
>> Sure. I'll go
onto evaluation.
So there's two main
stories in result.
The first is that there are
some infinite space of
transfer object kernals.
There's a lot of them.
And some of them,
some points in this space
have been hand optimized by
some summer, others have not.
But I'm going to show this for
ones that have some spots up,
we're as fast as
the one that has been
hand optimize by someone
but we have the same performance
for everything else.
So, wherever people haven't
hand optimized it, you
have two alternatives,
you can use this, or you can use
an alternative system that's
much slower than this.
There are few general systems
like the matter
of principle box,
or you can hand
optimized it yourself,
and get the same performance as
you've got from our system.
The second story is
that there's no one size
fit all format.
Some formats are good for
some matrices and tensors,
and some formats
are good for others.
So you have to have
a system that can
give you many formats,
so that you can shape
your formats to your data.
So let's look at
the very simplest.
One of the very
simple impressions.
So the reason I don't
show results for
Sparse matrix vector
multiplication that this is
the kernel at people that spend
most effort hand optimizing.
So the implementations
in libraries
like MKL is very fast.
So in matrix vector
multiplication of sparse matrix,
you multiplied by
a dense spectrum
to produce a volume output,
you dot product of this postural
with this dense vector.
The results are here,
this is TACO, MKL, OSKI,
which was written for
auto tuning apparently
Eigen, UBLAS,
and this thing which is
part of I believe Boost.
So, this result showed that
we are comparable performance.
Sometimes we are a bit slower.
So this is a normalized time,
so lower is better.
Sometimes we are a
bit slower than MKL,
which is really
a really good library.
Sometimes we are a bit faster.
So, it's about the same.
>> Sir, those on the y axis
have different inputs?
>> The matrices system, [inaudible]
sparse matrices collection.
Forgive me, I should have said
that first, it's normalized
>> Is just to a fraction of
the fact that
there's isn't really
a lot to do here in
the sense of performance.
It's not going to be fast,
it's not random
access, so really,
what your gaining as the fact
that you can just
handle anything and
you get nice regular format
for [inaudible]
>> Yeah for this operation,
does is likely that you write
about ten lines of code and
that's the best you can do,
pretty much software prefecture
is not buying you a lot.
And if it was buying
you a lot and we
could generate software
prefecture's too.
But it's not buying you
a lot, because you're bound,
I think on the length
of the store buffer.
You do have random access
into a vector
which means you don't
have to emit large code.
So you are locked up
in that [inaudible]
>> So in a case where
[inaudible] does do better,
[inaudible] to see why it is, it
should be pretty easy
to expect a code, right?
>> Yeah. Well, you don't
have the code from MKL.
>> Oh, okay.
>> It's a proprietor.
We think it's because
these are small matrices,
so they fit in cash and then,
>> And they do a better job
dealing with that?
>> Apparently yeah, and we
do a better job and it
doesn't fit in cash.
>> Okay.
>> But that is a hypothesis.
I haven't gone
and verified that.
>> So,
you don't really examine set
up possibly true location,
choose the most optimal idea.
There's no such
through [inaudible].
>> Given like one expression
of this expression and
one format in matrix,
you have no choice,
you have to follow
the format in the direction
the format goes.
>> But when you have,
some of them are bands.
>> Then you can tile
the alliterations space.
We don't do that.
We don't support
that at the moment.
So we can't generate
code like that.
So If it's all dense operations.
We are not the
fastest one out there.
You can file
the database though.
So if you do that, then you can
get good performance with this.
>> Why are this
librarys different?
Is this one you
[inaudible] this?
So where is the
difference coming from?
>> Their not that
different though,
this is almost within the noise.
I think Eigen is a little slower
because they use
some abstractions,
like iterators and
template multi programming
that might not
produce the best code.
So this is normalize times.
This is one, so maybe
this is five, 10 percent.
>> Is it within your
measurement noise or is it?
>> It could be yes. I don't
know the answer to
that. But it could be.
>> So this is a particular
sparse representation?
>> Yeah, this is XR.
>> Yeah, Okay.
So depending on where I'm at,
you actually have great different
results too, right?. [inaudible]
>> Right. Right.
I'll show it later.
What happens if you
have different matrices,
you have to choose
different representations
of the matrix.
Like you can't us
CSR for every matrix.
Some matrices you have to
use a different format.
And then, you have to
choose the right format to,
and the choice of the format
will guide how the code works.
So then you will actually
search different codes
and you will find the best code.
But you have an Emit
parallel code too.
And Emit parallel code were
much faster than
a non parallel once of course.
The MKL also have parallel code.
So again, we're comparable.
Now, we're actually
maybe a little faster.
>> Shared memory.
>> Shared memory paralleled,
one single socket
because the way we
initialized memory right now,
we can't have multiple sockets
because the first touch pulse,
is cache policies,
a memory replacement
policy will kill us.
But we can solve that easily,
we just haven't done that.
So it's Shared memory.
[inaudible]
>> So [inaudible] comes
from if in a particular row,
that are three,
non zero indices,
you can just look up the
dense vector in a [inaudible]
>> Yeah. So the way
we paralyzed,
we paralyzed on
auto loop, this is CSR.
This cause one threat,
that cause another threat,
they gather from
this spectrum in parallel.
Each of these produce
[inaudible] outputs
so there's no interference
on the output.
That this was CSC format.
You would have to
synchronize the output
because you're
scattering into output.
>> But if theories loading now,
like I have a lot more non zeros
in the first half
then second half.
So there is no dynamic works.
>> We can generate
code use of silk
and then solve
that problem with time.
But we don't do
anything for that.
Like most of
the optimizations for
SBM it's been focusing on.
How do you change the date
so that the data access is
better and you can do that
with this approach too.
You just do that spear
outside of them.
When we got the data we
the access that did not
order that sort of orthogonal
to call Generation.
>> What is a Sparsity
on those data set and how
does this compare with
a dense performance?
>> Right.
So these are very sparse.
I can't say all of them,
I think most of them are
constant number of non zeros on
each row so the more
elementary scrolls
the lower the spars it gets.
So this part of the growth OM
the size of [inaudible]
grow on square.
That's very common because
of locality in the world.
>> I see. So do you have
experience say that,
if you 90 percent of sparsity.
How do you come here
with that performance?
>> I have to bring
up our side did not
put that in our talk.
I should have had
a backup slide on it.
Forgive me but I'm
bringing up a paper
here but we have that graph.
That's a couple of points. So.
>> I cannot see it so clearly.
>> So, what this
is saying is that
for SPMB, sparse
vector multiplication,
if 66 percent of your values
are dense, you are at this end.
If you have more than
34 percent zeros then,
you're better off
with a sparse format.
This is a surprising result.
This does not match the result
from the Oscar PhD thesis.
But this is what
we have found them.
We're pretty confident
in that results
>> That is good to
know because, well,
at least a on
the domain we work on,
we see about 80,
90 percent of the sparsity
is not super sparse,
but we hope to think about it.
[inaudible] From that result,
it looks like it is
promising. Thank you.
>> Then probably that comes
from the factorization.
Why that 35 percent makes sense.
>> It could be. That
factorization matters less
because the memory bandwidth
is a bottleneck now.
That could be.
So, I'm going to look
at two more operations.
This is another
matrix operation.
If you forgive me I'll do
another linear algebra one
because this shows
something very interesting.
The next one will be tensor one.
This is sampled dense-dense
matrix multiplication.
It comes from data analytics.
It's taken from
a PhD thesis at Berkeley.
Your element y is multiply,
this is you multiply
two matrices,
and a then your element y
is multiply back,
it's a sparse matrix.
So, you have a dense matrix,
you multiply it
by a dense matrix.
That gives you 64
in the products.
But your element y
was multiplied by
a sparse matrix
which means that you
only have to do
this dot product wherever
you have a non-zero here.
So, if I have a non-zero there,
I have to do this dot product.
But there's only
10 of these guys.
I don't have to produce
this dot product.
So that brings me now
to turn in the products.
So, here you have to implement
this whole thing in one kernel.
If you implement this in two
binary kernels and compose them,
you'd lose a lot of performance.
So, this shows that this is
like a general thing that
happens outside of
linear algebra too,
where when you compose these
high performance libraries,
the first library might do work,
but the next library
throws away.
That's one problem.
The next issue,
if you can post
libraries without sort of
doing this kind of approach
you lose temporal locality.
And the third problem
is that you might end
up emitting data
from one library.
Then you have to reorganize
that data to go into
the next library because it
expects a different layout.
So, they don't
really compulsively.
The problem here, I was talking
to [inaudible] and told them,
so you heard about that earlier,
is that we are sort
of handwriting this specific
routines on workable layout,
and then we're composing them
with it by changing the data.
But we believe that
you need to shape the code
to adapt to the data.
You don't adapt
the data to the code.
You generate the code
for the data you have.
And then when you compose
through libraries,
you generate code
to use that format,
next library generates code
to use the same format.
So, you get friction free.
This is a model example
of why you need that.
So, the performance
here, shows Eigen does
this operation that
can be in Mongo.
So, we have similar performance.
Again, little noise
back and forth.
UBLAS does this as
two binary operations,
and then performance
dies which is expected.
In performance,
this number of axis is
going to go up the more
the sparsity of B increases.
So, there is an asymptotic sort
of difference depending on.
By the way, this is ON on zero,
so ON squared on zeros.
Then you have MTTRP which
comes from tensor factorization.
In this operation, you
have three index variables.
The index is three times zero.
You sum one of them
with one matrix,
and the other one
with a lot of metrics.
This is the main kernel in
the economic poliabric
factorization for tensors.
This is a sparse operation.
You have to do it in one kernel,
otherwise, the temporaries
will kill them.
So, you do, this
in one kernel and
we generate code to do that.
In that matter, there
is a toolbox has a sort
of an interpreter approach
where to take the tensor,
they transpose it so that
it looks like
a flattening matrix.
Then they call matrix libraries,
and then transpose it back,
so they shape it to fit the
routines they have available.
But the performance
is quite lacking
in this approach
for this operation.
But if you generate
exactly the right code,
you get good performance.
And SPLATT is recent birth by
Shapen Smith of
the University of Minnesota.
Here, he hand-optimized
this kernel.
And he gets excellent
performance better
performance cells.
We generate code that gets
comparable performance
to him a little
worse in most cases,
a little better there.
We know about this differences.
And I mentioned them
from our last slide,
how are we going to fix this.
So, what this shows
is that you have to
generate code for this whole
kernel here, in one go.
The final evaluation
will show that
different formats matters
for different matrices.
So, here, you have
a dense matrix.
And here's the SPMB performance.
That's all the forms
are row-major.
This is their matrix-vector
multiplication.
The dense format
is the best format,
of course, for dense matrix.
But if the dense is the
thermal matrix looks like that,
you want the dense-sparse
format as CSR,
because you have
something on each row,
but most rows don't
have much values.
So, you want to compress
the second dimension,
but not both dimensions,
then you're a little worse.
If that's like
a row-slice matrix,
you slice dense matrix,
you want this
sparse-dense format.
And if it's
a hyper sparse matrix
which is starting to
see in data analytics,
you want the
sparse-sparse formats.
So, there's
no one size fits all.
Different format matters
for different matrices.
You have to conform
them to each other,
and when you conform
them into each other,
you have to generate
correct code.
Finally, this is
a blocked FEM matrix.
So, here you have 3 by 3 blocks,
because you're simulating in
three dimensions with
these kind of matrices.
For us this is a 4-tensor.
So, you can generate code
for blocked in algebra.
This becomes
a 4-tensor operation.
This is SPMB, but dense tensors
blocked operation.
So, here, you're storing
with dense-sparse bandstands.
That is blocked
compressed sparse format.
And that will be
the best format.
If you don't pick along
to shield the block,
then you lose a lot of
performance because you
don't get to vectorized
those inner dense loops.
So, that is what I was
going to talk about.
>> So, let's say we have
a very complicated
tensor expression,
do you have to use
a transpose on it?
Let's say one expression
results in IJK and
another expression
results in IKG
>> Transpose will show up as
a cycle in the iteration graph,
which we don't support.
So, you have to transport it.
I'm sorry. So, an expression
with a transpose,
you have to transpose before
and then feed it
into expression.
Because the transpose
you are trying to access.
So you'd running
the format of the input
but you have to access the
other one in reverse order So,
you have to do random access,
but through steps.
>> So, assuming there
is a partial or a lot of
all the indexes in
your expression,
do all these kind of
expressions show up in practice?
>> Transpose expressions?
Absolutely. I showed some
in the beginning with PETs.
Which they transpose.
That expression we can
do because you multiplying
it by a vector, so it's okay.
You're not producing
a spot metrics.
But if you're doing
a single transpose
to a sparse metrics,
we'd hand-coded that
but our theory cannot
generate code for that.
That's future work too.
So, I didn't put it on here,
but it's one of
the future works.
We want to have portability.
So we want to do GPU's.
We looked at super TACO which
is super computing version.
Many more formats.
This gives us the optimization
that changes every
time we pick our B.
This will also give
us good stuff since
the matrix multiplication
algorithm for sparse matrices.
I'm working on that now,
and it's almost done.
We want a scheduling language
for the piling.
We want different semirings,
and you can do a lot of
Birken policy
because we give you
your code generating mechanism.
Next question is what
code you want to generate?
So, I think we are
out of time, right?
>> You can have five minutes.
>> I was going to
talk about Workspaces,
I will skip it.
>> Toleration, so there is
these TPUs also written on
them already in Hardware;
where the sensor operators
at the Hardware level.
How does this affect this?
It started that there's
another layer that can
pause your outfit to that
is that basically the way
to think about it?
>> Yes I sort of read
your question; so for the TPUs,
that would be a
different backend for us;
so we can compile to a TPU.
The Google TPU looks
like it's False Matrix;
Matrix multiplication
in general.
So we will have to generate
little metrics matrix
multiplications that I could use.
For Nvidia stuff
my mic with my TS;
writing a grant for
Nvidia now to use this as
the compiler for
their custom TPU units.
But what we are
trying to purchase
the TPUs shouldn't be as
hard coded as Google's TPU.
Because when you have
something like this you can
generate a TPU that
certain operations like this
can purchase really fast,
but that can check general code;
we don't have to sort of get
the exact hard core
of the thing;
that you want
more programmable TPUs.
Because then you
can use this better.
>>Yes and so there is
a direct relationship between
the reputation you
have and underline
hardware and support
at some level
but you haven't explore
that directly in your.
>> Yes I meant this program,
that's right. But they have
to match as well.
Otherwise you get friction.
So you have to
check everything to
fit just like
a hand in the glove.
>> Yes as I said it seem like
a natural direction
to investigate;
certainly because at some point
these things are rolling
out in our hardware and
hardware is hard change
at some point.
I think you want to get
it right the first time.
>> I think that
makes like the case
even more for a compiler
because if you
don't want to hand
write code for
all these kernels and then;
like I show you
the exponential explosion like
different platforms is
another axis of explosion.
So then a compiler
would be more useful;
but its future where I
could actually do it.
>>Sir. A little bit [inaudible].
I mean do these devices
or even GPUs. do they help
with the Sparse multiplication?
>> Oh yeah.
There's so much work
from the Michael Collins Group
of the media showing
that a GPUs are
quite a good Sparse
matrix-vector
multiplication In general;
and even as
Sparse matrix because
the GPU more floating parts of
Bercian but also has like
ten times the bandwidth.
So you can load values
from memory really
quick and you can
hide like latency with
their barrowing that's
barrow throbbing mechanism.
>>How Sparse is the operations
and typical machine
learning complications,
deep neural bits like
obviously mainly dense.
>> A lot of them are
expressed as dense operations;
but there's been
work especially from
Bill Dally's Group at Stanford;
looking at Sparse
neural networks;
and so you have
the typically inner
fully connected layer;
you have a weight
matrix you want to
multiply by like input domain.
And that weight matrix
during learning you can knock
out non-zeroes and it
doesn't affect
the accuracy of that much.
And then you create
Sparse there.
So the way I like
to think about it is
that if the neuroanatomy;
it's the brain,
it's a Sparse system you
just don't know what the
best and you're trying to find
that Sparse system by sort of
capping some non-zeros to zero.
But you have to artificially
create during learning
and some sums.
>>But none of that Sparse
is actually dynamic; like
you do have a [inaudible] .
But in practice make around
50-60 percent of them are
likely to be zero
but you don't know
which 60 percent
of them are likely to be zero.
Yes it changes; and so would
you be able to make use of that?
Because you'll have to convert
that into a Sparse
representation.
>> Yes. That's
a really good question.
So what we are finding in
our new working format;
this work which we are
doing now with PTC and
my group is that for
a situation like that,
packing into something
like CSR is too expensive;
but coordinates is sort
of the natural format;
you list up all the
non-zeroes; you don't have
to pack with coordinates;
it's friction free.
So coordinates might make
the competition a little
slower but since you don't
have any import cost,
you would be able to take
advantage of things like that;
depending on how Sparse it is.
But since you have
no like friction in
sort of packing it,
it doesn't really
matter that they change.
>>Something to add
is that portraying
that there are different ways
to get [inaudible]
or for inverse that
that part is static.
You can't force velocity.
>> So maybe for inferences
if it's static you can
use something like CSR
that's more compressed and
then for the others
you can use coordinates
which is a little less
compressed because you don't
compress the coordinates;
like the Matrix compressed but
the coordinate tree
is not compressed.
So you take
a little bit of a hit
that you don't get
the packing cost here.
Guess the last thing
I want to say is
that all of this is available
on GitHub, MIT license;
you can use this library or
you can use this web
interface that just
lets you generate
the expressions.
>>So why are some rights there?
What's the story with that?
Why does it say
some rights in there?
>> You have to give
acknowledgment;
that the bottom of MIT rights.
>> It's a standard MIT license?
>> It is a standard that's
just like the Wikipedia.
It's just that you
have to acknowledge us.
That's it.
So it's not the custom MIT,
it's just a standard role.
Yeah.
So on that page you can generate
kernels and play with
formats and all of that.
>>Are you aware of companies
that are using this?
>> This thing? So I've
been talking to tile
DB which is stocked
up databases around
Sparse scientific
data like the biogenetic data,
and they're very interested to
building the computer
and generate this,
and I'm talking to
some people from
the government who also
want to use it for stuff.
>>So and all the code is
in a [inaudible] basically?
>> Everything except this
JavaScript ;but if you wanted
this JavaScript we
could give it to you.
This just calls the server
that compiles it.
Everything is there.
We don't keep anything back.
>> How complex
your kernel can be?
Sorry just the basic operators
to fuse things into a kernel?
>> That's like
the answer is going to be
a little two parts.
Because you can have any size
of kernel; like any size,
but if you have
a lot of additions,
the notion that this
will grow exponentially;
doubly exponentially I believe,
which will if you add
say five Sparse matrices
you will have
100 megabytes of code.
So we give you that;
regenerate that happily for you,
but you should not ask for it.
We give you a lot of power but
power comes responsibility.
So very complicated not
just of additions that
it wants to grow it up.
You want to cut
up the expression
into multiple expressions;
and one could think about like
a system written on top of
these policy decisions for you;
that would be a very
interesting research.
For multiplications,
much of multiplications
you don't get that problem;
so if you have mixes;
some additions,
some multiplication will cut
down the space of
the much latices.
>>What is the compilation time?
>> Milliseconds.
You can see it
right here, right?
>>Is that real time?
>> Yeah. This goes like that;
and that's going
like to a server to
call my command line to get it.
That's fine. Yeah
well of course if you
have like really
confidence addition I may
produce 100 megabytes of
code but it'll take
a while to write that.
>>Do confuse it like just
add a lot; multiply logs.
>> I'm sure we can crash it.
Do you have it in
a virtual machine?
I don't think we explor
it but we will soon.
>> So is that compilation
that we got in
the linear of the size in the
for example one of
the graphs that you had;
the equation graph?
>> It's linear in size of
the graph but much like this;
like complicated latices
will cause it to grow out;
so really complicated latices
will produce a lot of code;
and it's linear in
the number of lattice points.
But it is a recursive
argument so if you
have like a match lattice
at one point,
you'll have to recurse down
for each lattice point;
because it's matching in
this dimension but
lots of subcases.
But it's linear in that size
so the compilation time
is not a problem.
