Uniform Intercomm Creation

A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially thought. Let me try to detail our adventure toward writing a uniform intercomm creation function.

Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about:

MPI_INTERCOMM_CREATE is used to bind two intra-communicators into an inter-communicator

In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like
MPI_Intercomm_create

So far so good, but what “uniformly” means? Based on some web resources uniformly is about consistency, in this particular instance about the fact that all participants will return the same answer (aka. the some inter-communicator or the same error code) despite all odds. Here we are looking specifically from the point of view of process failures, more precisely a single failure. Obviously, the trivial approach where one agrees on the resulting intercomm doesn’t work, as due to process failures some processes might not have an intercomm. Clearly, we need a different, better approach.

While talking about this, Aurelien proposed the following solution, entirely based on MPI_Barrier.

Looks quite simple and straightforward. Does it do reach the expected consensus at the end? Unfortunately not, it doesn’t work in all cases. Imagine that all peers in one side succeeded to create the intercomm while everyone in the remote communicator failed. Then on one side the interexists is TRUE while on the other side it is FALSE. Thus the local barrier (line 3) succeed on the side with interexists equal to TRUE and the intercomm is not revoked. As the intercomm exists, these processes will then go in the second MPI_Barrier (line 10), on the “half-created” intercomm, and will deadlock as there will never be a remote process to participate to the barrier.

What we are missing here is an agreement between the two sides (A and B) that things went well for everyone. So let’s try to use agreement to reach this goal. In the remaining of this post I will use the function AGREE, which is a point-to-point agreement between two processes (the two leaders). We need this particular function, as we cannot call MPI_Comm_agree on the bridge communicator, simply because the two leaders are not the only participants in the bridge communicator. Fortunately, implementing an agreement between two processes is almost a trivial task, so I will consider the AGREE function as existing.

If there is no failure the code works, but it’s an overkill. Now, if we have one failure during the execution there are several cases.
1. The failure happens before the MPI_Intercomm_create. Some processes will have an intercomm, and some will not. Thus the local agreement will make everyone on the same group (A and B) agree about the local existence of the intercomm. The AGREE between the leaders will then propagate this knowledge to the remote group leader. At this point the two leaders have the correct answer, and then the MPI_Comm_agree at line 8 will finally distribute this answer on the two groups. This case works!
2. The failure happen after the MPI_Intercomm_create. One can think about this case as a false positive (because the intercomm exists on all alive processes), but the real question is if we are able to detect it and reach a consensus. The MPI_Comm_agree at line 3 will make the flag TRUE on both groups. Note that on one side the MPI_Comm_agree will return MPI_ERR_PROC_FAILED as required by the MPI standard, but we do not take in account the return code here.

ANU presents PDE solver with ULFM at IPDPS

Mohsin Ali and Peter Strazdins presented their work on “Application Level Fault Recovery, Using Fault-Tolerant Open MPI in a PDE Solver”, during the IPDPS PDSEC workshop, last week. See the full slides for more details.

This novel work joins the growing list of applications benefiting from ULFM to feature fault tolerance; more examples are presented in these applications slides.

If you have worked on fault tolerant applications with ULFM, or are thinking about doing so, please contact us.

Preparing for June MPI Forum meeting

In preparation for the June MPI forum meeting, the specification has received some updates. The most prominent changes are:

  • The exposed memory in an RMA window may be completely undefined after a failure has occured.
  • MPI_Comm_agree now operates a binary AND on the flag argument.
  • Examples have been corrected to use error classes, instead of error codes, when relevant.

The latest version is available in the ULFM specification area

ULFM Specification update

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item.

This new update adds a new error code to separate process failure errors from non-impacted requests when they remain pending (MPI_ERR_PROC_FAILED_PENDING), and adds new examples.

Head to ULFM Specification for more info.