CPUs execute machine-language instructions,
but what can you do quickly with just one
instruction?
Some 40 years ago,
you were maybe able to calculate the sum of
two small integers.
Some 20 years ago,
you were able to do more complicated operations,
such as the multiplication of floating-point
numbers.
But nowadays you can do a lot more.
You can take entire vectors of floating-point
numbers,
and do, for example, elementwise multiplication
for them.
It turns out that modern CPUs are, in essence,
vector processors.
Even if your code is only doing individual
floating-point operations
compilers will still generate code that instructs
the CPU to use its vector registers to do
calculations.
If you look at CPUs in typical desktop computers,
you will see they have got two different kinds
of registers:
registers that can hold 64-bit integers
and registers that can hold 256-bit vectors.
If you compile your normal C++ code,
you will see that the compiler will typically
use
integer registers to hold pointers, indexes,
counters
and to do plain old integer arithmetic,
while vector registers are used for doing
floating point arithmetic.
But we can do a lot more with the vector registers
if we want!
A single-precision floating-point number is
32 bits long,
and a double-precision floating-point number
is 64 bits long.
So one 256-bit vector register is large enough
to hold
for example, 4 double-precision floating-point
numbers
or 8 single-precision floating-point numbers.
But you can use the large capacity of vector
registers
also in many other creative ways.
You can store, for example, 32 bytes of data
in one such register.
Here is a piece of text that is 31 characters
long
so you can actually store the entire sentence
in one vector register!
CPUs can do all kinds of operations with their
vector registers
but a very typical example is elementwise
arithmetic operations.
For example, elementwise addition of two vectors.
This instruction, vaddps, tells the CPU that
registers ymm0 and ymm1 contain vectors of
8 single-precision floating-point numbers.
This is the "ps" part, "packed singles".
And it asks the CPU to do elementwise addition
of these vectors
and to store the result in register ymm2.
So the instruction works a lot like taking
two arrays
and adding all pairs of elements.
We are doing 8 floating-point additions here,
in parallel, with just one instruction.
And it is fast.
As fast as doing just one floating-point addition!
But how do we write C++ code that makes use
of
these highly efficient machine-language instructions?
There is always a hard way.
If you want, you can make your code completely
unreadable
by using so-called intrinsic functions.
But there is fortunately also an easy way.
Your compiler can help you.
You can define so-called vector types and
then
if, for instance, x and y are defined to be
vectors,
you can just write x + y in your code
and you will get a vector addition.
The compiler will generate the right machine
language instruction for you.
Unfortunately the syntax of defining a vector
type in GCC is pretty ugly.
But you don't really need to remember it or
see it that often, either.
Just copy-paste this fragment to the beginning
of your program
and from then on you can just use this type
float8_t
whenever you want to define variables that
are vectors.
And now it is easy.
You can think of float8_t as a type that behaves
a lot like an array of 8 floating-point numbers.
But you can also write things like "a + b"
and it will do elementwise addition for these
two arrays.
And the key point is that whenever you use
vector types,
the compiler will generate very efficient
machine code.
This "a + b" here will be translated into
just one machine-language instruction, and
this instruction will do all 8 additions simultaneously
in parallel!
In general, you can pretty freely write code
that uses vector types,
and you can expect the compiler to do the
right thing.
Addition and multiplication will result in
elementwise operations.
You can mix scalars and vectors,
and you will get what you would expect.
You can also refer to individual elements
of the vector,
as if it was just a normal array of 8 elements.
But please note that as soon as you start
to refer to individual elements,
you will no longer benefit from efficient
vector operations.
To do lots of work in parallel,
you really need to do operations with entire
vectors.
Maybe refer to individual elements in preprocessing
and postprocessing
but make sure the critical inner loops do
as much as possible
with complete vectors.
You can imagine that float8_t is a small class
that contains 8 floats
and has some convenient overloaded arithmetic
operators.
You can pass these freely around in the code,
and the compiler will do the right thing.
For instance, this piece of code just works
fine.
You can pass vectors as parameters to functions.
You can define local variables of vector types.
You can even define small constant-size arrays
of vectors.
Your function can return vectors.
And the compiler will not only compile it
correctly,
but it will generate efficient code, something
like this,
with just 4 machine language instructions.
GCC even managed to keep everything in registers.
Here parameters "a" and "b" are passed in
registers %ymm0 and %ymm1.
Vector addition "a + b" is translated to one
instruction.
Vector subtraction "a - b" is translated to
one instruction.
Vector multiplication is translated to one
instruction.
And finally, the result is returned in register
%ymm0.
So vector types are pretty easy to use, and
lead to efficient code.
There is only one complication.
If you store vectors somewhere in memory,
you must take care of proper alignment.
Basically, all memory addresses have to be
multiples of 32.
If you just reserve some memory with "malloc",
this is not guaranteed!
Sometimes you accidentally get such addresses,
sometimes not.
Your code may sometimes work correctly, sometimes
it may crash.
You must use some memory allocation function
that guarantees correct alignment.
In the course material we have got detailed
examples,
and in the code templates that we have provided
for the exercises
you can find memory allocation functions that
you can use directly.
Please keep in mind that you will need these
whenever you allocate
arrays of vectors from the heap, and only
then.
There is no need to use them for anything
allocated from the stack,
there the compiler will take care of the right
alignment for you.
So, now you know what vector operations can
do,
and how to get the compiler to generate them.
But figuring out how to use them to actually
speed up your program
may require plenty of creativity.
And this is something we will discuss in the
next part!
