In this post, we will discuss two alternative proposals to simplify the
MPIX_COMM_FAILURES_GET_ACKED in the User Level Failure Mitigation MPI specification draft.
These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc.
Dealing with wildcard receptions
These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from
Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed.
The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in MPI codes that perform wildcard receptions from multiple threads. In order to give control to users, re-enabling wildcard receptions must be performed by an explicit call, that can be adequately protected with mutual exclusion between threads, when required.
Example 2 illustrates how to employ the
MPIX_COMM_FAILURES_ACK function to explicitly re-enable the reception of
ANY_SOURCE messages. After that call, operations that would return an error because of the potential failure of a sender process resume. The list of such processes for which failures are acknowledged can then be obtained (without side effects, possibly from another thread) by using
MPIX_COMM_FAILURES_GET_ACKED to produce the group of failed processes at the time of the ACK call.
What we like, what we don’t like
The ACK/GET_ACKED pair achieves the goal of controlling re-enabling any-source receptions even in multithreaded programs with a sane meaning that enable user-code to reason and synchronize around it. Even better, one can resume any-source receptions without ever producing the group of failed processes, or mandating the use of data structures linear in the number of ranks in each communicators.
However, the interface can be confusing for unsuspecting users.
MPIX_COMM_FAILURES_ACK mutate the state in a way that is not a priori known. When one calls that function, it has the effect of acknowledging all thus far detected failures (as seen from the implementation perspective). An user can know what has been acknowledged only after the fact, by using the sister call
Second, it is not possible to obtain the list of failed processes without side effects (i.e., acknowledging failures) on the considered communicator, which is undesirable in some common usage patterns that have been seen deployed by users of ULFM.
The proposed evolution, with two different twists
We want to add two new abilities, while at the same time making the interface more intuitive (i.e., reading the function name should be enough to understand what it does)
Reading the group of failed processes without side effects
The first addition we want to make is a way to peek at the list of failed processes without changing what is acknowledged (or not). This can be achieved in a simple enough way by adding the following function:
MPIX_Comm_get_failed(comm, &fgrp) MPI_Comm comm; MPI_Group* fgrp;
- This function obtains the current (local, non-synchronized) list of known failed processes.
- Repeated calls to the function return a monotonically increasing group (i.e., the group produced by an earlier call is always a prefix of the group produced during the current call).
- Two calls may produce different groups even if no MPI errors have been reported between the two calls.
- If a communication has raised a Process Failure error, the process that caused the error to be raised is in the produced group in a later call to MPIX_Comm_get_failed.
So, this is simple enough. If a process needs to know what processes are known to have failed thus far, without incurring the cost of synchronizing that knowledge with other ranks, that’s what you get from this function. Note that it means that another rank calling that function “at the same time” may get a different group (if that’s not your use case,
MPIX_Comm_skrink do synchronize fault knowledge).
You also get to peek at everything that the MPI implementation knows. That is, using this function, you may know that a process is dead before any error has been raised in a communication operation.
The groups produced by a sequence of calls are monotonic, that is, the result of any previous call to
MPIX_Comm_get_failed is a prefix to the current call. This will be important when we go to the next topic: controlling what we acknowledge.
Acknowledging only what we already know about, and knowing what we have acknowledged.
Know that we can produce a group of failed processes, lets use that knowledge to control how we acknowledge failed processes (and thus control any-source reception error behavior). In both variants we will also rely on previously described
Variant1: counting and acknowledged group
MPIX_Comm_failure_ack(comm, nack) MPI_Comm comm; int nack; MPIX_Comm_failure_get_acked(comm, &agrp) MPI_Comm comm; MPI_Group* fgrp;
The first function
MPIX_Comm_failure_ack lets the user code aknowledge (an re-enable any-source), but for the given maximum number of processes
nack. That is, if a code looks at the
fgrp and is satisfied that this is a correctable condition and any-source shall resume,
MPI_GROUP_SIZE(fgrp, &nack) prevents from acknowledging not-yet considered failures that may have been detected in the meantime. That is, we now have a way to know in advance what we are acknowledging.
Note that if
nack is smaller than what has been acknowledged in a previous call, the call is a no-op. Conversely, if
nack is larger than
nfailed, the size of the
fgrp that would be produced by a concurrent
MPIX_Comm_get_failed call, then only
nfailed processes are acknowledged.
The second function
MPIX_comm_failure_get_acked obtains the group of currently acknowledged faulty processes.
In simple cases (for example, when
nack <= nfailed), one can simply reason about how many members of
fgrp have been acknowledged. However, if
nack > nfailed, or when multiple software modules or threads are issuing
MPIX_Comm_failure_ack, one may want to see what is currently acknowledged without having to rely on shared variables.
MPIX_comm_failure_get_acked produces the group
agrp, the group of currently acknowledged failed processes. If no calls to
MPIX_Comm_failure_ack are issued on
comm (possibly from another software module or thread), the function keeps producing similar groups.
Variant2: Counting only
The first interface variant achieves almost all of the goals we have set. The only limitation is that one cannot just count how many processes have been acknowledged without producing the group
agrp. In some use cases (e.g., monte carlo), the only important criterion is how many workers are left, not what workers are still operating, hence producing
agrp can be seen as pure overhead.
A solution to that problem can be achieved by changing the acknowledging function so that it returns how many members in
fgrp have actually been acknowledged by the call.
MPIX_Comm_failure_ack_failed(comm, nack, &nacked) MPI_Comm comm; int nack; int* nacked;
This function does the same as the variant 1 above, but in addition it produces in
nacked the size of the
agrp group produced by a later call to
MPIX_Comm_faillure_get_acked. This permits counting the number of acknowledged (and thus failed) processes without ever building