This course page was updated until March 2022 when I left Durham University. For future updates, please visit the new version of the course pages.

Avoiding data-races in updates to shared memory #

In this exercise, we’ll use the synchronisation constructs we encountered when looking at OpenMP collectives to implement different approaches to combining the partial sums in a reduction.

We’ll then also benchmark the performance of the different approaches to see if there are any differences

Template code and benchmarking #

As usual, the starter code lives in the repository, in the code/openmp-snippets subdirectory. In reduction-template.c is some code that times how long it takes to run the reduction.

You can select the length of the vector to compute the dot product of by passing a size on the commandline. For example, after compiling with

$ icc -qopenmp -o reduction-template reduction-template.c

You can run, for example, with

$ OMP_NUM_THREADS=2 ./reduction-template 1000000

The implementation of the parallel reduction is left up to you.

You should implement a correct reduction using the four different approaches listed in the code:

“By hand”, using the same kind of approach as in openmp-snippets/reduction-hand.c;
Using an atomic directive to protect the shared updates;
Using a critical section to protect the shared updates;
Using the reduction clause on a parallel loop.

Solution

I implemented these approaches in code/openmp-snippets/reduction-different-approaches.c. The crucial thing to note with the critical section and atomics is to compute partial dotproducts in private variables and then protect the single shared update.

That is, do this

#pragma omp for
for ( ... ) {
  mydot += a[i]*b[i];
}
#pragma omp atomic
dotab += mydot;

Rather than this

#pragma omp for
for ( ... ) {
#pragma omp atomic
  dotab += a[i]*b[i];
}

Since the latter does far more locking and synchronisation than necessary.

Exercise
For each of the approaches, benchmark the time it takes to run the reduction for a vector with 1 billion entries across a range of threads, from one up to 48 threads.
Produce plots of the parallel scaling and parallel efficiency of the different approaches.
Solution
I did this with a vector with 500 million entries, and only ran up to 24 threads, because I didn’t want to wait that long, but the results will be broadly the same.
This time I plot time to solution against number of threads on a semilogy axis.
Time to solution (lower is better) for a vector dot product as a function of number of threads.
It looks like all the solutions are broadly identical in performance.
We can also plot the efficiency which tells a similar story
Parallel efficiency (for strong scaling) for a vector dot product.
It appears for these benchmarks that the “by hand” approach is marginally better, but we would have to do more detailed benchmarking to be confident.

Question
Which approach works best?
Which approach works worst?
Do you observe perfect scalability? That is, is the speedup linear in the number of threads?
Solution
As we see in the plots above, it looks like all the approaches are basically equally good for this test. The speedup is far from perfect, this is because the amount of memory bandwidth that the Hamilton compute node delivers does not scale linearly in the number of cores. So once we’ve got to around 8 cores, we’re already using all of the memory bandwidth, and adding more cores doesn’t get us the answer faster.

More details: thread placement #

The Hamilton compute nodes are dual-socket. That is, they have two chips in a single motherboard, each with its own memory attached. Although OpenMP treats this logically as a single piece of shared memory, the performance of the code depends on where the memory accessed is relative to where the threads are.

We will now briefly investigate this.

OpenMP exposes some extra environment variables that control where threads are physically placed on the hardware. We’ll look at how these affect the performance of our reduction.

Use the implementation with the reduction clause, we hope it is the most efficient!

The relevant environment variables for controlling placement are OMP_PROC_BIND and OMP_PLACES. We need to set OMP_PLACES=cores

$ export OMP_PLACES=cores

Hint

Don’t forget to do this in your submission script.

Exercise
We’ll now look at the difference in scaling performance when using different values for OMP_PROC_BIND.
As before, run with a vector with 1 billion entries, from one to 48 threads, and produce scaling plots.
First, use
OMP_PROC_BIND=close
This places threads on cores that are physically close to one another, first filling up the first socket, and then the second.
Compare this to results obtained with
OMP_PROC_BIND=spread
This spreads threads out between cores, thread 0 to the first socket, thread 1 to the second, and so on. I think this is the default setting when using the Intel OpenMP library.
Which approach works better? Is there a difference at all?
Solution
I just do this for the reduction clause case, since we determined that everything is basically the same for the other versions.
I do this (again running with a vector size of 500 million, but the results will be broadly the same for 1 billion entries I suspect), and find that when I fill up all the cores, the speedup looks pretty much identical.
However, at interim thread counts, OMP_PROC_BIND=spread produces better performance.
This is because it places threads on both sockets on the compute node, maximising the amount of memory bandwidth that the program can obtain.
The reduction performs better when using an intermediate number of threads if we specify a spread distribution.