This course page was updated until March 2022 when I left Durham University. The materials herein are therefore not necessarily still in date.
Getting compilers to the right thing #
We’re going to look at how to convince the compiler to vectorise a loop the way we want it to.
As our starting point we’ll use a C version of the GEMM micro-kernel used in the BLIS framework.
Rather than doing this on Hamilton, since we’re not actually going to run the code, we will use the Compiler explorer which is an online frontend to trying out lots of different compilers. I’ve set up a pre-filled version.
Does vectorisation occur #
Use -xBROADWELL -O2
as the optimsation flags. What does the
vectorisation report say? Which loop was vectorised?
The default blocking factors (MR
and NR
) are both 1. Given that
AVX registers can hold four doubles, it makes sense to use larger
blocks.
Exercise
Try with some larger blocks and-O3
instead of-O2
Question
Which loop, if any, was vectorised this time?
Vectorising the correct loops #
Since the i
loop is stride-1, it really makes sense to vectorise the
innermost loop. Try and convince the compiler to do so.
Question
Compare the estimated speedup the compiler reports for the original loop vectorisation choice and the new one.
What do you observe?
Providing more detailed information #
Part of the problem is that the compiler assumes that the array
accesses are not aligned to cache line boundaries (or vector
registers). Its cost model knows that these are more expensive than if
the data are aligned. However, this drives some bad decisions. We can
help the compiler by letting it know the byte alignment of the arrays.
Use __assume_aligned(PTR, 64)
to promise to the compiler that the given pointer is 64-byte aligned.
Question
Try adding alignment assumptions before the start of the loop nest.
What happens to the estimated speedup?
Trying to maximise throughput #
One thing you may have noticed is that the compiler performs a loop interchange. We have a triply nested loop
for (l = 0; l < ...; l++) /* loop 1 */
for (j = 0; j < ...; j++) /* loop 2 */
for (i = 0; i < ...; i++) /* loop 3 */
But when I run things, I see that the compiler reports it reordered
the loops from order (1 2 3)
to (2 1 3)
. It then attempts to
vectorise the innermost loop and unroll the middle loop. This loop is
now not fixed length.
Again, this is an example where the compiler’s cost model is
incorrect. We can help it out again by telling it to unroll the j
loop with #pragma unroll
.
We now need to pick good values for the NR
and MR
parameters.
Broadwell chips have 16 AVX registers, which is enough to store 64
double precision numbers. Given this information, what do you think
the best blocking factors might be. Note that since B[j]
is not
stride-1 in the innermost loop, to vectorise the FMA, we need to
replicate the value of B[j]
over all lanes of the vector register.
That is, inside the i
loop, we need as many registers for B
as the
unroll factor of the j
loop.
Question
What’s the maximum estimated speedup you can achieve?
Putting it together #
The subdirectory code/exercise09/blis-gemm/
contains a complete
implementation of this scheme (I don’t have optimal parameters
though). You can edit the parameters.h
file to set the blocking
parameters MR
and NR
, you’ll also need to edit micro-kernel.c
to
annotate the loops with the pragmas you found to be useful on the
compiler explorer.
Exercise
Set all blocking five blocking parameters in
parameters.h
to 1, and compile the code. Keep track of the values forMC
,KC
, andNC
since we’ll use them later.Run for a range of matrix sizes between 100 and 2000. What performance do you observe?
Exercise
Now restore the values forMC
,KC
, andNC
, and use your good parameters forMR
andNR
. Recompile and rerun the benchmarking. What performance do you observe now?
Exercise
In addition to having good parameters, now add the pragmas to the micro kernel in
micro-kernel.c
and rerun the same experiments again. What performance do you observe now?Last time I tried this, I found that I needed to stop the
micro_kernel
being inlined by the Intel compiler for really good performance, so change the signature to be__attribute__((noinline)) static void micro_kernel(...)
Rerun again, do you observe any further differences?
How close to peak performance does the code get?