Nature is governed by four fundamental interactions:
gravity, electromagnetism, the weak interaction
and the strong interaction.
The one that we’re interested in is the
strong interaction, and the theory behind
it is called Quantum Chromodynamics, or QCD.
With only two ingredients, quarks and gluons,
QCD tries to explain a wide range of phenomena:
from the collisions taking place at CERN,
with very high energy, to how the protons
and neutrons are squeezed together to form
the nucleus of the atoms, with very small
energy.
The problem is that, while for the first case
we can manage to understand the process using
pen and paper, for the second one it is very
difficult, since all our mathematical tools
break down at this energy regime.
If we want to use QCD directly to compute,
for example, how the mass of the proton emerges
from the interaction between the quarks and
the gluons, the only way to do it is by using
computers.
Or, to be more precise, supercomputers.
We need to put QCD on a computer.
The way to do it is by using lattice QCD.
It has the word lattice in it because what
we do is discretize space-time in a 4-dimensional
grid, placing the quarks on the nodes of this
lattice and the gluons on the links.
But why is it so computationally demanding?
We have to evaluate a path integral, or in
other words, we have to compute every possible
path that a quark can take from one point
of the lattice to another.
This is done by solving the following linear
system, where D contains all the information
about the interaction between quarks and gluons,
called the Dirac matrix, x is what we are
trying to get, called the quark propagator,
and b is the right-hand side vector, which
tells us where we put the quark on the lattice,
called the source.
Naively, one would have the temptation to
solve this problem by just inverting D. But
if I tell you that the Dirac matrix is a huge
matrix with a lot of elements equal to zero,
computing the inverse is not an easy task.
Therefore, it is crucial that we use a method
that can handle this type of problems.
One of these methods is the multigrid solver.
The multigrid solver can be though as a combination
of different algorithms with the main purpose
of speeding up the convergence of iterative
methods, and the way it achieves it is by
projecting the initial system into a smaller
one, which should be faster to solve.
Without going into too much detail, the solver
proceeds as follows:
Starting at the original lattice, the initial
guess of the solution and the real one are
very different, so we do a few steps with
an iterative solver: this is called "smoothing"
the solution.
Once it is smooth enough, we project our solution
into a coarser lattice.
Here, we can do the same as the previous step,
or if the new lattice is small enough, we
can try to find the exact solution.
With an exact solution, we have to project
it back to the original lattice.
Since now we have to "stretch" it, we have
to do a few more steps with the iterative
solver to make sure we still have the solution
for the linear system.
How can we make this solver faster?
Well, this is where vectorization comes into
play.
When coding, it is very common to see the
following loop:
One would think that the CPU does each operation
of the loop at a time: first computes c[0],
then c[1], etc.
But modern CPUs are much more intelligent.
Since CPUs aren't getting any faster, what
engineers have come up with was to increase
the number of bits that the CPUs can process
at a time, called registers.
If instead of having registers of 64 bits
and processing one double precision float
variable, we can have registers of 128 bits
(or even bigger) and process two variables
at a time.
There are two ways of vectorizing a code:
an explicit way and implicit way.
The explicit way requires the use of special
functions, while the implicit way, which is
the way we chose, is to prepare the loops
in such a way that the compiler knows they
can be vectorized.
So, what was my project about?
The original goal was to vectorize the whole
multigrid algorithm so that instead of solving
the linear system for one right-hand-side
vector, it could do it for 2 or more vectors
at the same time.
But after preparing the code, so that we could
put the loops over the right-hand-side vector
in the right place, we decided that we should
begin by modifying the iterative solver used
in the multigrid method, the generalized minimal
residual method, and see if it worked.
Here are the results
In the upper part of the plot, there is the
time it takes to get the solution of the linear
system using the normal solver or the vectorized
solver, as we increase number of right-hand-side
vectors.
And in the lower part, there is the speed-up
we obtain by vectorizing.
As we increase the number of right-hand-side
vectors, we reach a value of 2, which is the
ideal speed-up.
The reason we don’t get it right away, is
because the compiler puts a lot of “if statements”
to make sure that it uses the right amount
of data inside the registers, and they clearly
slow it down.
The next step is to solve this handicap and
then get the whole multigrid solver vectorized.
