This course page was updated until March 2022 when I left Durham University. For future updates, please visit the new version of the course pages.
Some pointers to more advanced features of MPI #
Communicator manipulation #
We saw that we can distinguish point-to-point messages by providing different tags, but that there was no such facility for collective operations. Moreover, a collective operation (by definition) involves all the processes in a communicator.
This raises two questions:
- How can we have multiple collective operations without them interfering with each other;
- What if we want a collective operation, but using only a subset of the processes (e.g. only processes with an even rank)?
We might worry that we’re reduced to writing everything by hand using point-to-point messages, but fear not, MPI has us covered.
Duplicating communicators #
To address point 1, collective operations match based on the communicator context, and MPI allows us to duplicate communicators. This provides us with a new communicator that contains exactly the same set of processes with the same ranks, but collectives on one communicator won’t interfere with those on another (and similarly for point-to-point messages).
int MPI_Comm_dup(MPI_Comm incomm, MPI_Comm *outcomm);
This is a very useful thing to use if you are writing a library that uses MPI. Whenever someone calls your library you do
MPI_Comm_dup(user_communicator, &library_communicator);
and then always use the library_communicator
inside your library.
Now you can guarantee that you will never accidentally match any
messages or collectives that the user runs on their
user_communicator
.
When we are done, we should release the communicator we duplicated (so
as not to leak memory) by calling MPI_Comm_free
int MPI_Comm_free(MPI_Comm *comm);
/* To release a communicator: */
MPI_Comm_free(&library_communicator);
Splitting communicators into subgroups #
This is useful if want some collective operation over a subset of all
the processes, for example we want to gather along the rows of a
distributed matrix. This can be done by calling MPI_Comm_split
int MPI_Comm_split(MPI_Comm incomm, int colour, int key, MPI_Comm *newcomm);
The colour
decides which ranks in incomm
end up in the same
newcomm
. Ranks that provide the same colour
will be in the same
group. The key
can be used to provide an ordering of the ranks in
the new group, usually we pass the rank from the incomm
.
For example, to create a communicator that splits into the processes
in MPI_COMM_WORLD
into a even and odd processes we can use.
int rank;
MPI_Comm incomm = MPI_COMM_WORLD;
MPI_Comm newcomm;
MPI_Comm_rank(incomm, &rank)
MPI_Comm_split(incomm, rank % 2, rank, &newcomm);
/* Do stuff with newcomm */
/* Release once we are done */
MPI_Comm_free(&newcomm);
Here’s a picture:
We emphasise again that this does not produce new processes, it just provides a communication context that does not contain all processes.
Exercise
code/mpi-snippets/split-comm.c
contains a simple example. Have a look at the code and compile and run it.Do you understand the output?
Do you understand why there is only one
splitcomm
variable (despite splitting the input communicator in two)?
This splitting facility is useful if we only need a subset of all the processes to participate in a collective operation. For example, the outer-product matrix-matrix SUMMA multiplication in the coursework requires a broadcast of matrix blocks along rows and columns of the 2D process grid, this is much simplified by creating communicators for the rows and columns.
Further features and details #
In addition to what we’ve seen, MPI provides a number of other features that are useful for writing libraries. We won’t cover them in detail, but just mention some aspects.
File IO #
MPI, via MPI-IO, provides a portable and high-performance way of reading and writing files in parallel. This forms the backbone of higher-level parallel file libraries like HDF5 and NetCDF.
Profiling interface #
All MPI functions (everything called MPI_XXX
) are actually just
wrappers around internal “profiling” functions whose names start with
PMPI_XXX
. For example, MPI_Send
is implemented in the MPI library as
int MPI_Send(const void *sendbuf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
return PMPI_Send(sendbuf, count, datatype, dest, tag, comm);
}
The public MPI_
functions are exported with weak symbol
binding so we can override
them. For example, suppose that we want to print a message every time
an MPI_Send
is called, our code could do:
int MPI_Send(const void *sendbuf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
printf("Sending message to %d\n", dest);
return PMPI_Send(sendbuf, count, datatype, dest, tag, comm);
}
This facility is used to write tools that can produce timelines of message-passing in a parallel program. These include
One-sided messaging and Remote Memory Access #
All of the messaging we saw was two-sided, in that we need both a send and a receive. MPI-2 introduced, and MPI-3 extended and improved, support for one-sided messages and direct access to remote (off-process) memory. For details on these features, if you’re interested, I recommend the books Using MPI and Using Advanced MPI. See also Torsten Hoefler’s tutorials.
Language bindings and libraries #
Julia has MPI bindings in MPI.jl, and distributed arrays in the MPIArrays.jl package.
Python has wrappers via mpi4py. For distributed array computation, look at dask.
For parallel sparse linear algebra, and PDE solvers, PETSc and Trilinos are robust and well-supported libraries.