In this post, we will discuss two alternative proposals to simplify the MPIX_COMM_FAILURES_ACK
and MPIX_COMM_FAILURES_GET_ACKED
in the User Level Failure Mitigation MPI specification draft.
These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc.
Dealing with wildcard receptions
These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from MPI_ANY_SOURCE
.
Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed.
The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in MPI codes that perform wildcard receptions from multiple threads. In order to give control to users, re-enabling wildcard receptions must be performed by an explicit call, that can be adequately protected with mutual exclusion between threads, when required.
Example 2 illustrates how to employ the MPIX_COMM_FAILURES_ACK
function to explicitly re-enable the reception of ANY_SOURCE
messages. After that call, operations that would return an error because of the potential failure of a sender process resume. The list of such processes for which failures are acknowledged can then be obtained (without side effects, possibly from another thread) by using MPIX_COMM_FAILURES_GET_ACKED
to produce the group of failed processes at the time of the ACK call.
What we like, what we don’t like
The good
The ACK/GET_ACKED pair achieves the goal of controlling re-enabling any-source receptions even in multithreaded programs with a sane meaning that enable user-code to reason and synchronize around it. Even better, one can resume any-source receptions without ever producing the group of failed processes, or mandating the use of data structures linear in the number of ranks in each communicators.
The Bad
However, the interface can be confusing for unsuspecting users.
First, the MPIX_COMM_FAILURES_ACK
mutate the state in a way that is not a priori known. When one calls that function, it has the effect of acknowledging all thus far detected failures (as seen from the implementation perspective). An user can know what has been acknowledged only after the fact, by using the sister call MPIX_COMM_FAILURES_GET_ACKED
.
Second, it is not possible to obtain the list of failed processes without side effects (i.e., acknowledging failures) on the considered communicator, which is undesirable in some common usage patterns that have been seen deployed by users of ULFM.
The proposed evolution, with two different twists
We want to add two new abilities, while at the same time making the interface more intuitive (i.e., reading the function name should be enough to understand what it does)
Reading the group of failed processes without side effects
The first addition we want to make is a way to peek at the list of failed processes without changing what is acknowledged (or not). This can be achieved in a simple enough way by adding the following function:
MPIX_Comm_get_failed(comm, &fgrp) MPI_Comm comm; MPI_Group* fgrp;
- This function obtains the current (local, non-synchronized) list of known failed processes.
- Repeated calls to the function return a monotonically increasing group (i.e., the group produced by an earlier call is always a prefix of the group produced during the current call).
- Two calls may produce different groups even if no MPI errors have been reported between the two calls.
- If a communication has raised a Process Failure error, the process that caused the error to be raised is in the produced group in a later call to MPIX_Comm_get_failed.
So, this is simple enough. If a process needs to know what processes are known to have failed thus far, without incurring the cost of synchronizing that knowledge with other ranks, that’s what you get from this function. Note that it means that another rank calling that function “at the same time” may get a different group (if that’s not your use case, MPIX_Comm_agree
, or MPIX_Comm_skrink
do synchronize fault knowledge).
You also get to peek at everything that the MPI implementation knows. That is, using this function, you may know that a process is dead before any error has been raised in a communication operation.
The groups produced by a sequence of calls are monotonic, that is, the result of any previous call to MPIX_Comm_get_failed
is a prefix to the current call. This will be important when we go to the next topic: controlling what we acknowledge.
Acknowledging only what we already know about, and knowing what we have acknowledged.
Know that we can produce a group of failed processes, lets use that knowledge to control how we acknowledge failed processes (and thus control any-source reception error behavior). In both variants we will also rely on previously described MPIX_Comm_get_failed
.
Variant1: counting and acknowledged group
MPIX_Comm_failure_ack(comm, nack) MPI_Comm comm; int nack; MPIX_Comm_failure_get_acked(comm, &agrp) MPI_Comm comm; MPI_Group* fgrp;
The first function MPIX_Comm_failure_ack
lets the user code aknowledge (an re-enable any-source), but for the given maximum number of processes nack
. That is, if a code looks at the fgrp
and is satisfied that this is a correctable condition and any-source shall resume, MPI_GROUP_SIZE(fgrp, &nack)
prevents from acknowledging not-yet considered failures that may have been detected in the meantime. That is, we now have a way to know in advance what we are acknowledging.
Note that if nack
is smaller than what has been acknowledged in a previous call, the call is a no-op. Conversely, if nack
is larger than nfailed
, the size of the fgrp
that would be produced by a concurrent MPIX_Comm_get_failed
call, then only nfailed
processes are acknowledged.
The second function MPIX_comm_failure_get_acked
obtains the group of currently acknowledged faulty processes.
In simple cases (for example, when nack <= nfailed
), one can simply reason about how many members of fgrp
have been acknowledged. However, if nack > nfailed
, or when multiple software modules or threads are issuing MPIX_Comm_failure_ack
, one may want to see what is currently acknowledged without having to rely on shared variables.
Thus, MPIX_comm_failure_get_acked
produces the group agrp
, the group of currently acknowledged failed processes. If no calls to MPIX_Comm_failure_ack
are issued on comm
(possibly from another software module or thread), the function keeps producing similar groups.
Variant2: Counting only
The first interface variant achieves almost all of the goals we have set. The only limitation is that one cannot just count how many processes have been acknowledged without producing the group agrp
. In some use cases (e.g., monte carlo), the only important criterion is how many workers are left, not what workers are still operating, hence producing agrp
can be seen as pure overhead.
A solution to that problem can be achieved by changing the acknowledging function so that it returns how many members in fgrp
have actually been acknowledged by the call.
MPIX_Comm_failure_ack_failed(comm, nack, &nacked) MPI_Comm comm; int nack; int* nacked;
This function does the same as the variant 1 above, but in addition it produces in nacked
the size of the agrp
group produced by a later call to MPIX_Comm_faillure_get_acked
. This permits counting the number of acknowledged (and thus failed) processes without ever building fgrp
or agrp
.
You must be logged in to post a comment.