This course page was updated until March 2022 when I left Durham University. For future updates, please visit the new version of the course pages.
OpenMP collectives #
So far we’ve seen how we can create thread teams using #pragma omp parallel
and distribute work in loops between members of the team by
using #pragma omp for
.
Now we’ll look at what we need to do if we need to communicate between threads.
Reductions #
Remember that the OpenMP programming model allows communication between threads by using shared memory. If some piece of memory is shared in a parallel region then every thread in the team can both read and write to it.
Recall also that there is no synchronisation across accesses to shared memory. Let us consider what this means for computing a vector dot product
$$ a \cdot b = \sum_i^N a_i b_i $$
A serial loop to compute this code might look like the following
dot = 0;
for (int i = 0; i < N; i++) {
dot += a[i]*b[i];
}
A naive parallelisation would just annotate the loop with #pragma omp for
.
dot = 0;
#pragma omp parallel for default(none) shared(dot, a, b) schedule(static)
for (int i = 0; i < N; i++) {
dot += a[i]*b[i];
}
However, this has a problem. The shared variable dot
is
updated by multiple threads, and so we are not guaranteed that all
increments will be seen since there is a write race in the increment of dot
.
I discuss some tools for detecting data races below.
We can test this out with the following code
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
const int N = 1024;
int *a = malloc(N * sizeof(*a));
int *b = malloc(N * sizeof(*b));
/* Intialise with some values */
for (int i = 0; i < N; i++) {
a[i] = i+1;
b[i] = i*2;
}
/* Check */
int dotabserial = 0;
for (int i = 0; i < N; i++) {
dotabserial += a[i]*b[i];
}
printf(" Serial a.b = %d\n", dotabserial);
int dotabparallel = 0;
#pragma omp parallel default(none) shared(a, b, N, dotabparallel)
{
#pragma omp for schedule(static)
for (int i = 0; i < N; i++) {
dotabparallel += a[i]*b[i];
}
}
printf("Parallel a.b = %d\n", dotabparallel);
free(a);
free(b);
return 0;
}
Exercise
This example code computes a dot product in serial and then in parallel. But the parallel version has race conditions.
Try running with different numbers of threads. Do you always get the correct answer in parallel? Do you always get the same wrong answer?
Solution
I, at least, don’t always get the same wrong answer (but I generally get the wrong answer).
The solution to this problem is to create partial sums on each thread, and the accumulate them in a thread-safe way. We could do this like so
|
|
As the comments indicate, all the barriers are quite delicate.
Exercise
Run this code, check that it continues to give you the correct answer independent of the number of threads you use.
Now explore what happens if you accidentally got some of the barriers wrong.
- What happens if you add
nowait
to thesingle
directive on line 28?- What happens if you add
nowait
to thefor
directive on line 32? (Remove thenowait
from line 28 again!)Solution
For 1., you probably get a
Segmentation fault
, because we need to wait for the array to be allocated, this means we need to synchronise after the allocation.For 2., you probably get the wrong answer. The reason is that if one thread finishes in the parallel loop before the others, it will immediately go ahead and start adding up the contributions, so it’ll pick up whatever happens to be the current value in the other entries in
dotlocal
(which will likely be incomplete).
Directives to the rescue #
We see that writing a collective reduction by hand is possible, but a
bit tedious. OpenMP provides facilities to handle all of the gory
details by adding an extra
reduction
clause to the for
directive
#pragma omp for schedule(static) reduction(+:dot)
for (int i = 0; i < N; i++) {
dot += a[i]*b[i];
}
This tells OpenMP that dot
is a reduction variable, and that the
combining operation is +
. It now becomes the compiler’s job to
generate appropriate code for privatising the partial reduction
contributions from different threads and then combining them.
reduction
clauses have some restrictions on the type of data they
can combine. We need an associative binary operator (so that the
combination can happen in any order) and an identity element for the
operation (so that the compiler can generate the right initial value
for the privatised reduction variables).
Reductions are defined for builtin datatypes (such as int
, double
)
and for the combining operations in the table below
Operation | Operator | Initialisation |
---|---|---|
Addition | + | 0 |
Multiplication | * | 1 |
Subtraction1 | - | 0 |
Minimum | min | Most positive number of given type |
Maximum | max | Most negative number of given type |
Logical and | && | 1 |
Logical or | || | 0 |
Bitwise and | & | ~0 (all bits on) |
Bitwise or | | | 0 (all bits off) |
With this, we can rewrite our dot product example
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
const int N = 1024;
int *a = malloc(N * sizeof(*a));
int *b = malloc(N * sizeof(*b));
/* Intialise with some values */
for (int i = 0; i < N; i++) {
a[i] = (i+1) * (2 - (i % 5));
b[i] = i*2;
}
/* Check */
int dotabserial = 0;
for (int i = 0; i < N; i++) {
dotabserial += a[i]*b[i];
}
printf(" Serial a.b = %d\n", dotabserial);
int dotabparallel = 0;
#pragma omp parallel for schedule(static) default(none) \
shared(a, b, N) reduction(+:dotabparallel)
for (int i = 0; i < N; i++) {
dotabparallel += a[i]*b[i];
}
printf("Parallel a.b = %d\n", dotabparallel);
free(a);
free(b);
return 0;
}
This is both more succinct, and potentially better optimised. For example, the compiler might implement a tree reduction (rather than the linear reduction I coded above).
Where can you use reduction
clauses?
#
You don’t need to annotate a loop with a reduction. Suppose that you have a parallel region where the threads each do a single expensive operation and then combine the results. A reduction is entirely appropriate here too.
Inter-thread synchronisation #
Sometimes, we’ll want to do something other than just compute a reduction. In these cases, if we want to pass information between threads, we need to synchronise on reading and writing to shared memory.
Barriers #
OpenMP has a number of constructs for this purpose. We’ve already
implicitly seen one of them, namely barriers. In a parallel region,
#pragma omp barrier
can be
used to synchronise all threads in the team.
#pragma omp parallel for shared(a) default(none)
{
int tid = omp_get_thread_num();
int nthread = omp_get_num_threads();
/* Produce some thread-specific data */
a[tid] = some_computation(tid, ...);
#pragma omp barrier
/* Read data from a neighbouring thread */
b[tid] = a[(tid+1)%nthread] + ...;
}
Without the barrier, there is no synchronisation between the writes to
a
and the reads when updating b
. We would therefore likely get the
wrong answer.
The barrier ensures that no thread attempts the update to b
before
all threads have arrived at the barrier (after updating a
).
Either all threads must encounter the barrier, or none, otherwise we get a deadlock.
#include <omp.h>
#include <stdio.h>
int main(void)
{
const int N = 16;
int a[N];
for (int i = 0; i < N; i++) {
a[i] = -1;
}
#pragma omp parallel default(none) shared(a, N)
{
int tid = omp_get_thread_num();
int nthread = omp_get_num_threads();
a[tid] = tid + nthread;
if (tid % 2 == 0) {
/* deadlock! */
#pragma omp barrier
}
a[tid] = a[(tid + 1)%nthread];
}
for (int i = 0; i < N; i++) {
printf("a[%2d] = %2d\n", i, a[i]);
}
return 0;
}
Exercise
Try this deadlock situation out with the above code.
Does it run successfully with one thread? What about two or three?
Hint
To terminate the hanging program, typeControl-c
at the commandline.Solution
It should work fine with one thread, but not more than one.
Recall that often barriers are implicit in worksharing constructs. So
if we are in a parallel region we do not need a barrier between two
#pragma omp for
directives for synchronisation (because #pragma omp for
has an implicit barrier at the end of the loop).
Critical sections and atomics #
Sometimes a barrier synchronisation is a little heavy-handed. For this OpenMP provides us two further constructions. For protecting code and variables respectively.
The first is the critical section #pragma omp critical
.
This directive specifies that a region of code must be executed by
only one thread at a time.
Here’s a trivial example, counting the number of threads in a parallel
region (don’t do this, use omp_get_num_threads()
).
int nthread = 0;
#pragma omp parallel default(none) shared(nthread)
{
#pragma omp critical
{
nthread += 1;
}
}
Critical sections are more useful if you’re parallelising over a shared data structure.
For example, consider a shared task stack of work from which we can pop work and push work. In pseudo-code, this looks approximately like the below.
stack = ...; /* Create initial work */
while (1) {
task = pop(stack); /* Get the next work item */
if (task == NULL) {
break; /* No work left, exit loop */
}
newtask = work(task); /* Do work, potentially producing a new task */
if (newtask != NULL) {
push(newtask, stack); /* If there's a new task, add it to the stack */
}
}
We can parallelise this with OpenMP, but we need to make sure there
are no race conditions when pushing and popping from the shared
stack
data structure. This can be achieved with critical
sections
stack = ...;
#pragma omp parallel default(none) shared(stack)
{
while (1) {
#pragma omp critical modifystack
{
task = pop(stack);
}
if (task == NULL) {
break;
}
newtask = work(task);
if (newtask != NULL) {
#pragma omp critical modifystack
push(newtask, stack);
}
}
}
Here we protect the modification of the stack by critical sections.
Points to note:
- I gave the critical sections an optional name (
modifystack
). All critical sections with the same name synchronise. If no name is provided this matches any other critical section without a name. - We need the critical sections to have the same name for the push and pop because both of these sections modify the same shared data structure.
- This probably isn’t a very good implementation because threads might exit the loop too early (if they pop from an empty stack before new work is pushed by another thread).
Design of high-performance datastructures for these kind of irregular computations is actually an ongoing area of research. If you’re interested, look at some of the work that the Galois team are doing.
Exercise
Modify the
reduction-hand.c
example to use a critical section to ensure the result is always correct.Solution
This was actually the topic of the synchronisation exercise, so see the solutions there.
Atomics #
Even finer-grained than critical sections are #pragma omp atomic
directives. These can be used to protect (some) updates to shared
variables.
int nthread = 0;
#pragma omp parallel default(none) shared(nthread)
{
#pragma omp atomic
nthread += 1;
}
An atomic directive protects variables (not code, like critical sections). In particular, it protects the write to the variable on the left hand side of an assignment. The allowed form is one of
x op= expr
whereop
is one of+
,-
,*
,/
,&
,^
,<<
, or>>
x++
,++x
,x--
,--x
.
Note that the evaluation of expr
in the first form is not protected,
it is only the write to x
that is protected.
Exercise
We can use these synchronisation constructs to implement different approaches to the reduction example. The openmp exercise on reductions does this.
Tools for detecting data races #
We need to be careful when writing parallel code that we do not accidentally introduce race conditions that produce incorrect results. There are some tools available to help with this. On Linux-based systems you can use helgrind.
Modern versions of GCC and Clang also offer a thread sanitizer
mode
enabled with -fsanitize=thread
. Again, this is transparently
supported on Linux, but seemingly not on MacOS.
Often, thinking hard is your best bet.
Sometimes these tools will report false positives, and you need to work a little bit to eliminate them. See this nice article for more information.
Exercise
Try compiling and running the
reduction-race.c
) example using GCC and thread sanitizer enabled. Run with two threads, does it help you to understand the race condition?On Hamilton load the
gcc/9.3.0
module, on COSMA load thegnu_comp/10.2.0
module. You should then compile with$ gcc -fopenmp -g -fsanitize=thread -o race reduction-race.c
The
-g
adds debug symbols so that we see line numbers in the error reports.Solution
If I do this and then run with two threads, I see output like the following:
WARNING: ThreadSanitizer: data race (pid=8685) Read of size 4 at 0x7ffcc8e32e54 by thread T1: #0 main._omp_fn.0 /ddn/home/vtdb72/phys52015/code/openmp-snippets/reduction-race.c:27 (foo+0x400d35) #1 gomp_thread_start ../../../gcc-9.3.0/libgomp/team.c:123 (libgomp.so.1+0x19ec5)
Previous write of size 4 at 0x7ffcc8e32e54 by main thread: #0 main._omp_fn.0 /ddn/home/vtdb72/phys52015/code/openmp-snippets/reduction-race.c:27 (foo+0x400d4f) #1 GOMP_parallel ../../../gcc-9.3.0/libgomp/parallel.c:171 (libgomp.so.1+0x10fc1) #2 __libc_start_main <null> (libc.so.6+0x22504)
Location is stack of main thread.
Location is global ‘<null>’ at 0x000000000000 ([stack]+0x00000001fe54)
Thread T1 (tid=8687, running) created by main thread at: #0 pthread_create ../../../../gcc-9.3.0/libsanitizer/tsan/tsan_interceptors.cc:964 (libtsan.so.0+0x2cd6b) #1 gomp_team_start ../../../gcc-9.3.0/libgomp/team.c:836 (libgomp.so.1+0x1a4e5) #2 __libc_start_main <null> (libc.so.6+0x22504)
This tells me that multiple threads had a write-race on line 27, which is where the
dotabparallel
variable is updated.
Summary #
As well as straightforward loop parallelisation, OpenMP also provides
a number of constructs to help with accumulation of results and
synchronisation when updating shared variables. The biggest hammer is
a barrier
, but these can often be avoided in favour of more
fine-grained directives. The most useful is probably the reduction
clause for loops.