This course page was updated until March 2022 when I left Durham University. The materials herein are therefore not necessarily still in date.

Verifying a model with measurements #

The goal of this exercise is to verify our model for the number of loads and stores in a stream benchmark using performance counters, accessed via likwid-perfctr.

Background #

I provide an implementation (in code/exercise05/stream.c) written in C of the STREAM TRIAD benchmark. It provides scalar, SSE, and AVX implementations of the loop

double *a, *b, *c;
...
for (i = 0; i &lt; N; i++) {
  c[i] = c[i] + a[i] * b[i];
}

We will measure the number of loads and stores for this loop using likwid-perfctr. So our first task is to compile the code appropriately.

Compiling with likwid annotations enabled #

The code is annotated with likwid-specific markers around the relevant loops. This way we can ensure that we’re only measuring counters for the bit of the code we’re interested in. We therefore need to enable these when compiling. As before we load the likwid module with module load likwid/5.0.1.

We can safely compile this code with GCC, load the GCC module with module load gcc/9.3.0 and run:

gcc -std=c99 -mfma -O1 -DLIKWID_PERFMON -fno-inline -march=native -o stream stream.c -llikwid

The flag -DLIKWID_PERFMON adds a new symbol in the preprocessor which will turn on the likwid markers. We then also need to link against the likwid runtime library with -llikwid. To ensure that this library is available when you run the code, you should load the likwid module on the compute node (or run module load likwid/5.0.1 in your batch submission script).

For this exercise, likwid-perfctr does not give us an appropriate predefined group. Instead, we must use a low-level counter directly. For loads the relevant counter is MEM_UOPS_RETIRED_LOADS_ALL, for stores it is MEM_UOPS_RETIRED_STORES_ALL. We must specify a group string which consists of the name of the counter and the register to save it in. For memory operations we can use the registers PMC0 and PMC1 (possibly others, but these will suffice).

So, a complete command line to measure the number of loads is

likwid-perfctr -m -g "MEM_UOPS_RETIRED_LOADS_ALL:PMC0" -C 0 ./stream 1000000 sca

This says to enable the marker API (with -m), to count the MEM_UOPS_RETIRED_LOADS_ALL event in PMC0, and to pin the executable to core zero (with -C 0). This final part is necessary so that the operating system does not move the executable half way through, breaking our measurements. You should see output similar to the following

--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz CPU type: Intel
Xeon Broadwell EN/EP/EX processor CPU clock: 2.20 GHz
--------------------------------------------------------------------------------
Sleeping longer as likwid_sleep() called without prior initialization
sca loop, sum 3.33328e+17
--------------------------------------------------------------------------------
Region Scalar, Group 1: Custom
+-------------------+----------+
|    Region Info    |  Core 0  |
+-------------------+----------+
| RDTSC Runtime [s] | 0.001473 |
|     call count    |        1 |
+-------------------+----------+

+----------------------------+---------+--------------+
|            Event           | Counter |    Core 0    |
+----------------------------+---------+--------------+
|     Runtime (RDTSC) [s]    |   TSC   | 1.472779e-03 |
| MEM_UOPS_RETIRED_LOADS_ALL |   PMC0  |      3001190 |
|      INSTR_RETIRED_ANY     |  FIXC0  |      4629691 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  |      3247484 |
|    CPU_CLK_UNHALTED_REF    |  FIXC2  |      3247574 |
+----------------------------+---------+--------------+

This measurement at least, aligns with what we expected, since we see about 300000 loads.

Question
How does the number of loads change if you use the SSE, AVX, or FMA versions of the code?
How many stores do you measure?
Can you find a way of measuring the stores and loads in one go?

Question
Try compiling the code with:
gcc -std=c99 -mfma -O1 -DLIKWID_PERFMON -fno-inline -o stream stream.c -llikwid
Do you still measure the same instruction counts for all cases?
To understand what is going on, we can inspect the assembly code for these two cases by getting the compiler to spit that out instead:
gcc -std=c99 -mfma -O1 -DLIKWID_PERFMON -fno-inline -S -o stream-no-march.s stream.c -llikwid gcc -std=c99 -mfma -O1 -DLIKWID_PERFMON -fno-inline -march=native -S -o stream-march-native.s stream.c -llikwid
Compare the loop bodies for the functions, do you observe any differences in the assembly that might explain things?