Non-blocking messages #
As well as the blocking point to point messaging we saw last time, MPI also offers non-blocking versions.
These functions all return immediately, and provide a “request” object that we can then either wait for completion with or inspect to check if the message has been sent/received.
The function signatures for
MPI_Isend
and
MPI_Irecv
are:
int MPI_Isend(const void *buffer, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *request);
int MPI_Irecv(void *buffer, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *request);
The mpi4py versions are:
request = comm.Isend([buffer, count, datatype], dest, tag)
request = comm.Irecv([buffer, count, datatype], src, tag)
So both sends and receives now return a request object.
With the blocking versions (
MPI_Send
,MPI_Ssend
,MPI_Bsend
), the buffer argument is safe to reuse as soon as the function returns. Equally, as soon asMPI_Recv
returns, we know the message has been received and we can inspect the contents.This is not the case for non-blocking calls.
We are not allowed to reuse the buffer (or rely on its contents being ready) until we have “waited” on the
request
handle.See below for details on how to do this.
If we have a request, we can check whether the message it corresponds
to has been completed with MPI_Test
request.Test()
=> True # if the request is completed (message is sent/received)
=> False # otherwise.
The return value will be true if the provided request has been completed, and false otherwise.
If instead we want to wait for completion, we can use
MPI_Wait
request.Wait(self, Status status=None)
Which waits until the message corresponding to request
has been
completed.
Both of these calls can complete the message exchange. If MPI_Test
returns True
, the message has been sent/received
and the user-provided send/receive buffer is safe to be used again.
Here’s a picture of a non-blocking MPI_Issend
matching with a
blocking MPI_Recv
. Note how the data transfer does not start
(because this is a synchronous send) until the matching receive has
been posted (set up). So the first MPI_Test
returns false. The
MPI_Wait
will return immediately because the message has now been
transferred.
Why would you do this? #
Non-blocking messages allow us to separate “posting” messages from when we check if they are completed. One reason to do this is that MPI libraries often have optimisations to complete sends quickly if the matching receive already exists.
If I am receiving messages from 10 different processes, if I use
a blocking MPI_Recv
, then there is only ever one receive ready at
any one time. Conversely if I use MPI_Irecv
, then all receives will
be ready, and the MPI library can complete them as the matching send
arrives.
It also allows us to simplify programs that exchange many messages if we’re trying to avoid deadlocks. We can just post all sends/receives at once and then wait, rather than having to arrange that we have a single send/receive ready at the right time.
Finally, non-blocking communication allows us to (in theory) overlap communication with computation. This can help to improve scaling performance in some cases.
As you probably saw when doing the ping-pong exercise, all MPI messages have a non-zero latency. That means that no matter how small it is, it takes some time for a message to cross the network. If we use blocking messages, the best case total time for our simulation is going to be
$$ T_{\text{compute}} + T_{\text{communicate}} $$
Many scientific computing simulations have compute and communication parts that can overlap. For example, when domain decomposing a mesh for a parallel PDE solver, most of the computation can be done without communicating with our neighbours: we only need information when we’re near the edge of our local domain. We can therefore often split the simulation into phases:
- Send data to neighbours
- Compute on local data that doesn’t depend on neighbours
- Receive data from neighbours
- Compute on remaining local data
If we use non-blocking messages, we can sometimes hide the latency in steps (1) and (3), so that the total simulation time is now
$$ \max(T_{\text{compute}}, T_{\text{communicate}}) < T_{\text{compute}} + T_{\text{communicate}} $$
The implementation of halo exchanges in the Grid data structure will use this facility.
Waiting for multiple messages #
The advantage of the non-blocking communication mode becomes more apparent when we look at waiting or testing for completion of multiple messages simultaneously.
A typical pseudo-code with non-blocking communication might look something like this
requests = []
for _ in range(nrecv):
requests.append(comm.Irecv(...))
for _ in range(nsend):
requests.append(comm.Isend(...))
# Some work that doesn't depend on the messages
...
Having done the work that doesn’t depend on messages, we now need to wait for message completion.
Perhaps we need all the messages to complete, in which case we can use
MPI_Waitall
MPI.Request.Waitall(requests)
This approach is preferred over a loop calling MPI_Wait
on each
request, since the MPI implementation is free to process the arriving
messages in any order when we call MPI_Waitall
which might speed
things up.
Perhaps we just want a message to have arrived, in which case we can
use MPI_Waitany
which = MPI.Request.Waitany(requests)
Now the which
variable tells us which of the requests completed.
Finally, suppose we want to wait until at least one message has
completed, we can use
MPI_Waitsome
indices = MPI.Request.Waitsome(requests)
# Now len(indices) tells us how many requests are completed,
# and indices[0..nfinished-1] tells us which requests they are
There are also matching
MPI_Testall
,
MPI_Testany
, and
MPI_Testsome
calls which don’t block for completion of the messages.
A high quality MPI implementation will provide optimised code for
these routines that is more efficient than a loop with
MPI_Test
/MPI_Wait
pairs.
Exercise
Gathering data from every process
Write an MPI code in which rank-0 gathers a message from every process and places it in an array at a position corresponding to the rank of the sender.
So if running with $P$ processes, rank-0 should allocate an array with space for $P$ entries, and after collecting the messages.
Compare the performance of two versions.
- rank-0 uses a blocking
MPI_Recv
for all receives- rank-0 uses non-blocking
MPI_Irecv
followed byMPI_Waitall
.Which performs better as a function of the total number of messages, $P$?
Wildcard matching #
So far, we’ve always specified specific source
and tag
arguments in
the arguments to MPI_Recv
and MPI_Irecv
. MPI also provides us with
the option to say “receive a message, I don’t care who its from, or
what the tag is”.
We do that by providing MPI_ANY_SOURCE
and/or MPI_ANY_TAG
as the
source and tag arguments respectively.
We can subsequently, find out where we got the message from, and what
its tag was, by inspecting the status
object that MPI_Recv
returns.
Up to now, we’ve just said MPI_STATUS_IGNORE
, but we can also do
status = MPI.Status()
comm.Recv(..., MPI.ANY_SOURCE, tag=MPI.ANY_TAG, status=status)
status.source # The source rank
status.tag # The tag
There actually aren’t that many reasons you would use wildcards in receives. They can be useful when implementing dynamic sparse data exchange.
Typically, the implementation of “wildcard” matching is less efficient than message matching with given source and tag arguments.
Summary #
As well as providing blocking send/receive options, MPI provides non-blocking versions.
These allow us to potentially improve performance of message exchange, and simplify writing algorithms that need to match many pairs of messages, without thinking as hard about potential deadlocks.
The critical thing to recall is that we are not allowed to look at
the buffers we pass into non-blocking sends/receives until after
calling a blocking MPI_Wait
-like call, or a non-blocking
MPI_Test
-like call has returned true.