Hi.
Welcome back to the lecture series on GPU
architectures and programming.
So, if you recall, we have been discussing
different possible GEMM optimizations.
For example, we started with the basic GEMM.
The first optimization was just pushing in
our normal tiling concepts i.e. how the idea
of tiling helps in matrix multiplication when
done with shared memory based tiles.
But then we also saw that leads to a small
amount of compute with respect to the loads
and that can be further improved.
So, all that we did was we coarsened the threads.
And in that way, we could actually reduce
the ratio of load with respect to the computes.
And to say in other words, we actually were
able to increase the amount of compute per
load, by doing the coarsening optimization.
And then we exploited the wider loading GEMMs
that are available in NVIDIA GPUs because
they do not support vector operations like
multiply and add, which is vectorized.
But they actually provide you with wider load
and store instructions.
So we made good use of this float8 kind of
data types here for 
using the wider load instructions for filling
in the tiles.
That is how we will put it.
But that also required us to go through this
complex switching of values, which we have
already explained in the last lecture , for
supporting the wider data types and all that.
But the next thing that comes is something
that we called rectangular tiles.
So we will just do a brief recap on why that
is required.
So in the rectangular tiles part, if you remember
that K40 GPUs have got 48KB of shared memory.
Whereas for these tiles that we have defined,
they were 32 x 32 tiles, 2 of them, each storing
4 bytes of data.
So it was overall 8KB.
So that is the basic reason that why you would
want to have a rectangular tile.
Now well, what do we do with that?
So let us just browse through this idea of
rectangular tiles once again.
So the idea was pointing to the fact that
since we have a large amount of shared memory,
we can increase the tile size and that is
what we did.
So we increase the tile size from 32 x 32
to tiles of size 64.
So that is what we discussed.
We will increase the rectangular tile size.
And , there was also this optimization that
since we are supporting bigger tiles, why
not store the matrix B in a transpose way.
Because then you have the advantage of memory
access coalescing while loading from the global
memory.
But then we also figured out that leads to
a problem with the shared stores.
For that we needed to flip the indices here,
so that the shared memory stores are done
in a nice way.
So that they can be fetched again properly.
But that also led to a problem that when I
am now looking for loading from the shared
memory, they are all falling in the same bank.
So, in order to reduce this bank conflict,
we added the data here.
So, that is a short summary of how rectangular
tiles really helped us.
So these are the changes.
You do global loads from B where B is pre-transposed.
And then you do shared store through the matrix
Bsub in such a way that the load is optimized.
And the way you optimize the load is you paired
the data types here with 2 elements, because
in each bank you have memory width of 64 bits
for the loads.
So you paired with 2 positions here.
And that gives you a huge advantage over the
baseline implementation.
Now, let us just start with the next optimization,
which is what we call as 2D register blocking.
So, just to remember that in case of rectangular
tiles, we had to do the transpose and then
store properly in the shared memory, while
doing the transformation from the local memory.
And then we had to pair the data here.
So, these were the other 3 things we needed
to do simultaneously.
Now again coming to the register blocking
part, earlier when discussing the optimizations
where we are doing the coarsening, I would
say in 1 dimension because if you remember
while we were doing thread coarsening, we
are computing for multiple data points in
1 dimension.
When we were supporting wider loads again
we were doing a similar coarsening in the
other dimension.
So, the general idea could be that you increase
the work per thread in both the row and column
dimensions.
So, this is what we call as 2D register blocking.
So, essentially we are doing the register
loads in such a way that it is essentially
similar to 2D tiling that we discussed for
the shared memory based optimization for matrix
multiplication.
But this is done at a different level.
Let us just replicate the idea that you are
doing a shared 2D load of data from the global
memory to the shared memory.
Let us just replicate the idea and load data
in a similar way from the shared memory to
the global memory right.
So that is what we will be calling as 2D register
blocking.
So it is similar to 2D tiling, but it is done
at a different memory level.
In the earlier case, it was from the global
memory to the shared memory.
In this case is from the shared memory to
the registers..
So, what really the optimization helps in
is that it helps to reduce the shared memory
traffic.
Just like using the earlier optimization of
normal standard tiling, what we are really
doing where we are really optimizing the global
memory off chip traffic?
So in this case, you are using the same idea
between shared memory and the register file
to reduce the shared memory traffic.
So let us just have a look at how it works.
So we will just use our usual definitions,
the tile size in dimension M. Remember that
we are multiplying M x K and K x M matrices,
tile size TSM in dimension M, tile size TSN
in dimension N and tile size TSK in dimension
K.
This is what we had set.
And then if you remember earlier, we are defining
work per thread while doing coarsening.
We will just use the same idea in terms of
work per thread in dimension M and work per
thread in dimension N.
So, if we are trying to coarsen the work in
both of these dimensions, then we will have
a reduced tile size in both the dimensions.
So, these are tiles which will be based on
the shared memory for loading data to the
registers.
So then the reduced tile size in dimension
M would be just like the tile size in dimension
M divided by the amount of work we are now
giving to a single thread in dimension M.
So it is just tile size in dimension M divided
by the work per thread that we are now trying
to define in the same dimension.
So that gives me RTSM.
Similarly, I can have RTSN.
So that is essentially nothing but TSN by
the work per thread that I have defined in
that dimension N.
So, once this is done, then I can define what
is the amount of data load that each thread
has to do for A and B matrices.
In this simplistic case, we will consider
2 of them to be same here.
So LPTA is nothing but the overall tile size
for the M x K matrix divided by the reduced
tile size in the dimension M multiplied by
the reduce tile size in the dimension N.
So that gives you the loads per thread for
A. Similarly what happens to loads per thread
for B?
Well you consider the original tile size in
dimension K and tile size in dimension N for
the matrix B and you divide it by the reduced
tile size in the respective dimensions.
So, essentially you are dividing the original
tile size by the reduced tile size and that
is giving you the amount of loads that a thread
has to do for A and similarly, the amount
of load a thread has to do for B right.
So, since this TSM and TSN are considered
to be equal in our case, the loads per thread
for the matrix A i.e.
LPTA and the loads per thread for matrix B
i..e LPTB are going to be same.
So, with this we can then go and define the
blocks and the grid.
So, they can be just like this.
I am dividing M by TSM, N by TSN.
That would be my dim3 definition of blocks
and similarly for threads.
So, what will be my basic steps when I am
going to execute this 2D register blocking.
So, of course, one thing is to be sure that
I am increasing the work per thread further
by using this optimization.
For identifying the threads we use, each thread
will initiate the following local variables.
So let us say I define this tidm and tidn
just for noting down the local thread ids
with respect to the columns and the rows.
And of course, the maximum value each of them
will be this TSM divided by WPTM and similarly
TSN divided by the WPTN like we had defined
earlier.
And similarly we can compute the offsets which
will be used soon i.e. exactly from which
block, the thread shall start working.
That is figured out by multiplying this tile
size in the dimension M with the block ID
for the thread.
And similarly, the offset in the dimension
N can be found by the tile size in dimension
N multiplied by the block id in the Y dimension.
So, the next thing that comes is that we have
to define the memory to fit a tile for A as
well as a tile for B, right.
So, how do you really do it?
Well, we already have these TSKs and TSMs
defined.
So that gives me Asub as TSK x TSM, and Bsub
as TSN x TSK + 2.
If you remember our earlier idea of padding
we had introduced, that is actually getting
carried over here.
So the next thing you have to do is initialize
a 2D accumulation register.
Earlier, it was a 1D array, but now since
it is 2D register blocking, you are initializing
these acc in a 2D array with all 0s for the
float values, right.
But then comes the question.
What is the overall operation that you have
to do for per thread for every tile?
So, this is your number of tiles - K divided
by the TSK i.e. your tile size in the K dimension.
So, that is the total number of tiles you
have in that K dimension over which you have
to hop right.
So, this is the tile loading loop, i.e. the
outer loop of the multiplications.
So, this gives you the number of tiles and
inside this you have the same old steps to
be done for this newer setting.
You have to load 1 tile of A and 1 tile of
B into the shared memory and then you have
to loop over the values of a single tile and
perform the computation just like earlier.
Only thing is that you have more work per
thread right now, and at the end of the entire
computation when all these load and compute
for all the tiles is done you just write back
the values in the result matrix C.
So here comes the load part.
So how do you really load?
Of course, one may start thinking that I have
got this 2D data to load.
But we will do it in one single loop here.
The way we are doing it is, as you can see,
for each position, I am just iterating from
0 to LPTA, right, the load per thread value.
So this is the amount of data I am supposed
to load and since LPTA is equal to LPTB, just
by iterating, over this 2, I can do the loading
for the both the Asub and Bsub matrices.
And from where am I supposed to load?
Well, we are representing all the threads
in the first and second dimension by one global
variable ID.
So let us understand, what is the problem
here?
So we have already defined the work per thread.
But now, I mean, ideally you want to start
thinking that this would be a 2D loading.
So why do not I have a cascade of 2 loops
for doing the loading?
Instead of that what we are doing is that
we also know how many loads one thread is
supposed to do.
So let us just iterate one loop from 0 to
that maximum number of loads that one thread
is supposed to do, and figure out a global
position for each thread.
You figure out a global position in the array
from where you are supposed to do the load.
So essentially we are trying to figure out
by computing an id.
And then, that position, I can just figure
out because I know the M, and what is the
offset inside it.
And then I can go to the corresponding row.
So by computing these values, I can just figure
out from where to load.
And then the other thing I will do is that,
I will figure out the id of the thread by
using this local computation.
You can just easily check how this ID is getting
computed.
And if you just do a percentile and a divide
operation, you can just find out what is row
and column index in Asub and Bsub where you
are supposed to put the value.
Mind that your Asub and Bsub have opposite
modes of access, because for B you are having
the transposed matrix.
So just like in the previous case here, we
would say so, these are always the same as
rows.
So, in using this loop, you are just identifying
here through this access expression from which
location in the matrix A you are doing the
load from and which location in the matrix
B you are doing the load.
And by doing this computation of id with this
formula, you are just figuring out which number
to load and then you are multiplying it by
this reduced tile size to go to that corresponding
location.
And then you are just doing an offset with
the tid which you are already computed here.
So with this, you are able to go to the locations
of both A and B matrices.
So, again, we will just repeat that here,
we computed the tile index.
And then you are just multiplying the tile
index with M and N, because these are the
corresponding dimensions to look for in the
A and B matrix right.
Mind that B is transposed here.
So, these are the dimensions to look for in
the A and B matrix.
And then from A and B, you are doing our load
to Asub and Bsub just like we have discussed
earlier.
So, the variable ID that we which we have
computed here, you are just then doing these
2 operations to figure out what is the row
and column value where to load.
Once the loading is done, you are going to
run this computational loop where you are
actually going to perform the multiplications
and for that, you do this.
This is your outer loop, then inside the outer
loop, first notice that the outer loop is
definitely going to run for K values right
up to TSK.
That is the tile size in K. And then, because
again, the next time we load and again, we
are going to run these for TSK.
And that is just like how you do computation
with tiles as we all know.
So then, the thing that you do is you cache
the values of Bsub in the registers.
Now that is an important step.
So observe what is going on.
So, you have got this value of Bsub.
Now you are going to load them in this array
Breg.
Now, why is this important?
Observe that Bsub is shared.
So it is in the shared memory, but where is
Breg?
Breg is a local array, which means this will
be located in the register.
So you are now going to cache this value of
Bsub.
So, you are just computing these column index
for the same K and getting the value from
Bsub and you are loading it into Breg right.
So, in that way, with this loop, you are loading
all the values of Bsub that you require for
multiplication in Breg.
Well, what is the next step?
So, this actually caches all the required
values for Bsub and then you get to the computation.
So, now, for the real multiplication computation,
you have this for loop where you are running
it for this wm = 0 to WPTM, right.
And here, what you are going to do is well
first you compute the row index from using
this tidm and these wm values and that gives
you the exact value of Asub, which is going
to be multiplied.
Well, so, as you can see, you are essentially
caching a single value of Asub into the register
and that is the value that is going to be
multiplied with all the values of Breg continuously
and stored in the accumulator for the corresponding
value.
So, we are not repeating this because as you
can understand this follows the similar pattern
like our earlier optimizations.
You need all the consecutive B values, but
you need this single A value.
That is why you have already cached these
B values.
So, as you can see, this is all the work for
one thread.
So, the real multiplication by the thread
is really happening here.
And since you have 2 loops, so, in that way,
you are getting a coarsened set of values
done, right.
So, if we just repeat inside C, you have got
these tiles defined for 1 tile.
Well, these are the reduced tile sizes.
I am drawing for one thread.
You are now making that thread do a 2D computation.
So, this loop is actually iterated over the
number of computations that is to be done.
Work per thread in M is in one direction of
course.
You have this A and B. So, this outer loop
is going over the work per thread in direction
M and the inner loop is going over the work
per thread direction N but for a single position
of this direction M, you load 1 Areg value
right and for that you are going to use these
sequence of Breg values.
So, in every outer iteration what you are
doing is that you are caching a single value
and then in the inner iteration, you are going
to use this Areg value with this array of
Breg values.
So, what is important to notice is that every
thread has got per thread activity.
For every thread for each tile, you have got
some per thread activity.
So if you see this is per thread activity
for a thread any one time.
So for that single tile, what you are really
doing is that you are first figuring out for
this tile, what are the values I am supposed
to load.
So, this is where the threads’ loading of
values from the shared memory to the registers
are getting done.
So, that is why we have cached the values
of Bsub in registers.
So, here, inside this compute loop, there
is some amount of memory activity going on.
The memory activity to be more specific is
for loading values from the shared memory
to the registers.
And this is what we would say is the crux
of the optimization.
Like earlier, we have already seen this optimization
which is like loading from global memory to
shared.
And next, the current optimization is that
you load from shared to a 2D register.
So, that is what you do.
As you can see inside this loop for one value
of K, you are first making a sequence of loads
from Bsub.
So, you are making a sequence of loads from
Bsub to here, and then that is the activity
for this thread.
And similarly, for other threads also, they
are all making their own corresponding caching
of values of Bsub for this loop.
And for this thread, after it has got these
values cached, it goes to perform the real
computation.
So what it does is inside this for loop, first
it performs the computation of the row index.
Then it uses the row index to compute what
value is to be loaded into Areg for the corresponding
Asub shared memory.
For that, it has got this index i.e. which
row and K is getting computed.
And then here, it is keeping it constant,
and multiplying it continuously with these
Breg values.
So it is multiplying this Areg continuously
with Breg values and it is iterated here over
this.
Well once this is done, again it will load
from the outer loop, it will again, load another
corresponding value of from Asub to Areg.
And then again in the inner loop, it will
multiply this Areg value with a sequence of
values from Breg and that is how it will work.
So, with this we have the computation done
in 2D.
And once this is done, the next thing is loading
it back to a global memory.
So for that, well, first you have to figure
out what is a location where it has to be
loaded.
And again, as you can see, since this is a
2D activity, this thread is going to again
load back a 2D array of values.
Again, you have a cascade of 2 loops.
And the thread figures out what is the global
position where it has to do the load.
So that is figured out by first computing
a global row from the outer loop.
And then in each iteration of the inner loop,
it figures out a global column value.
And this combination of global column and
global row is used to figure out a position
in the C matrix where this value from acc
has to be written back.
So this is how the final values get stored
in C.
So this is our idea of how these different
possible optimizations can be performed for
doing the GEMM computation.
So just to recall, all we do here is that
we are using our earlier optimizations together,
but doing it in a 2 dimensional way, by coarsening
the threads in both dimensions.
Well here we have not used the wider load
idea.
That is also something you need to understand.
But the important thing is for you to figure
this out , i.e. how the computes work as I
am giving a 2D amount of computation to be
done for each thread.
That is why now I have a cascade of loops
here for doing all those computations and
also the loading of the data.
And as I recall that there are 2 levels of
loading.
Here you have a load going on from the global
to the shared memory.
And then in your earlier compute loop, you
now have an amount of load first done from
the shared to the registers.
And these 2 steps we are now kind of differentiating
here.
And then I am using the cached values to multiply
i.e.
I am caching a single value of Areg and multiplying
it with a sequence of values of Breg, and
so on so forth through this loop cascade.
And then again, I am storing it back using
another loop cascade.
So right now I have more work per thread.
And these are good optimizations in the sense
that in modern GPUs you have sufficiently
big shared memories and if you can just choose
a proper tile size and a suitable thread coarsening
factor in the 2 dimensions then this can really
help.
So the summary here would be that GEMM is
the most computationally heavy kernel which
is used for fully connected layers, and also
the convolution layers of neural network and
the optimized implementations that we have
discussed so far.
They focus on the training phase of course,
and if we are speaking about the other phase
of neural network work like inferencing, that
refers only to a forward pass over a trained
neural network.
And so in general, these optimizations we
have discussed will likely be more important
for an HPC kind of process where you are primarily
going to do large neural network training
in significantly powerful computers, whereas
inferencing is the more popular workload for
embedded GPUs.
Because in an embedded GPU, you can have an
inferencing engine for doing a lot of stuff,
like recognizing somebody's image from a camera,
or doing some object detection.
Or maybe live streaming of video, doing a
face detection and all that are inferencing
kind of workloads.
So they also require different kinds of optimizations,
which you can figure out.
So, with this maybe we will like to close
this current lecture.
Thank you.
