This course page was updated until March 2022 when I left Durham University. The materials herein are therefore not necessarily still in date.
A roofline analysis of matrix-vector multiplication #
The goal of this exercise is to perform a roofline analysis of matrix-vector multiplication. We will look at the effect that compiler flags have on the performance of generated code. Next, we will see if the performance that we obtain is independent of the problem size. Finally, we will investigate loop blocking.
Background #
I provide an implementation1 in code/exercise04/dmvm.c
, in C, of double-precision
matrix-vector multiplication, which computes
$$ \vec{y} = A \vec{x} $$
for a \(n_\text{row} \times n_\text{col}\)
rectangular matrix. It takes two command-line arguments, the number of
rows and the number of columns. When run, it creates the matrices,
performs the computation and prints out information about the
performance. For example, after compiling and running with
./dmvm 1000 2000
you might see output similar to
1019 1000 2000 6491.23
. The four columns are:
- The number of iterations the test was run for;
- The number of rows the matrix had;
- The number of columns the matrix had;
- The performance in MFLOPs/s.
Instead of downloading these code individually you should clone the
repository on Hamilton and work in the appropriate
code/exerciseXX
subdirectory.
Setup #
As usual, you should do this on Hamilton.
Downloading and compiling the code #
We’ll use the Intel C compiler for this exercise, which requires us to
load some extra modules. To get the most recent version, run module load intel/2019.5
and module load gcc/8.2.0
. In addition, we will want
to make performance measurements with likwid, so also module load likwid/5.0.1
.
You can use wget
to download the code when logged in.
Note: wget
does not work on the compute nodes, so you should
download any material on the login (frontend) node.
Compilation without any optimisation #
First, we will compile without any optimisations.
Normally you should not do this, but we want to see what the effect is!
icc -std=c11 -O0 -o dmvm dmvm.c
You should now be able to run the dmvm
binary with ./dmvm 4000 4000
for a \(4000 \times 4000\) matrix.
Obtain the machine and code characteristics #
To perform a roofline analysis, we need the machine and code characteristics.
Exercise
Measure the main memory streaming memory bandwidth using thetriad
benchmark oflikwid-bench
.
Exercise
Compute the peak floating point throughput of a compute node. A single core on the Hamilton nodes runs at 2.9GHz and can issues two double precision FMAs (fused multiply-add instructions) per cycle.
Exercise
Read the code, in particular thedmvm
function and determine the best case arithmetic intensity by counting data accesses (using the perfect cache model) and floating point operations.
Produce a roofline plot of -O0
compiled code
#
Exercise
Using a fixed number of columns \(n_\text{col} = 10000\), measure the performance of yourdmvm
binary over a range of row sizes from 1000 to 100000. Plot the resulting data using a roofline plot.
Question
What do you think your next step in optimising the code should be?
Better compiler options #
We’ll stop crippling the compiler. Try these sets of compiler options:
icc -std=c11 -O1 -o dmvm dmvm.c
icc -std=c11 -O2 -o dmvm dmvm.c
icc -std=c11 -O3 -o dmvm dmvm.c
icc -std=c11 -xCORE_AVX2 -unroll16 -mfma -O3 -o dmvm dmvm.c
Exercise
Add the results you get from these runs to your roofline plots. What do you see?
Question
Is performance for any of these options independent of the problem size? What do you think the bottleneck for this algorithm is?
Loop blocking for out-of-cache performance #
You should have observed that when the number of rows gets too large, the performance of the code drops by almost a factor of two. When the number of rows is very large, some of the vector data (which is accessed more than once) no longer fits in cache.
Exercise
Use the pessimal cache model to obtain a worst case arithmetic intensity and add these new data points to your roofline plot.
A mechanism to solve this problem is to traverse the data in an order
that is aware of the cache. For loop-based code, this is called loop
blocking or tiling. I provide an implementation of a simple
scheme in
code/exercise04/dmvm-blocked.c
. Compile it, using the set of
compiler options that worked best for the original code. Use -o dmvm-blocked
to produce a new binary.
Exercise
As before, using a fixed number of columns \(n_\text{col} = 10000\), measure the performance of yourdmvm-blocked
binary over a range of row sizes from 1000 to 100000. You should observe that the performance is now largely insensitive to the number of rows. Plot the resulting data using a roofline plot.
Question
What do you think your next step (if any) in optimising the code should be?