Categories
Research ULFM User Level Failure Mitigation

Simplifying the ACK/GET_ACKED couple

In this post, we will discuss two alternative proposals to simplify the MPIX_COMM_FAILURES_ACK and MPIX_COMM_FAILURES_GET_ACKED in the User Level Failure Mitigation MPI specification draft.

These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc.

Dealing with wildcard receptions

These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from MPI_ANY_SOURCE.

Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed.

Any-source reception must be interrupted to let the user code avert the deadlock risk if the expected sender process has died.

The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in MPI codes that perform wildcard receptions from multiple threads. In order to give control to users, re-enabling wildcard receptions must be performed by an explicit call, that can be adequately protected with mutual exclusion between threads, when required.

Example 2 illustrates how to employ the MPIX_COMM_FAILURES_ACK function to explicitly re-enable the reception of ANY_SOURCE messages. After that call, operations that would return an error because of the potential failure of a sender process resume. The list of such processes for which failures are acknowledged can then be obtained (without side effects, possibly from another thread) by using MPIX_COMM_FAILURES_GET_ACKED to produce the group of failed processes at the time of the ACK call.

A combination of ACK/GET_ACKED lets resume any-source receptions.

What we like, what we don’t like

The good

The ACK/GET_ACKED pair achieves the goal of controlling re-enabling any-source receptions even in multithreaded programs with a sane meaning that enable user-code to reason and synchronize around it. Even better, one can resume any-source receptions without ever producing the group of failed processes, or mandating the use of data structures linear in the number of ranks in each communicators.

The Bad

However, the interface can be confusing for unsuspecting users.

First, the MPIX_COMM_FAILURES_ACK mutate the state in a way that is not a priori known. When one calls that function, it has the effect of acknowledging all thus far detected failures (as seen from the implementation perspective). An user can know what has been acknowledged only after the fact, by using the sister call MPIX_COMM_FAILURES_GET_ACKED.

Second, it is not possible to obtain the list of failed processes without side effects (i.e., acknowledging failures) on the considered communicator, which is undesirable in some common usage patterns that have been seen deployed by users of ULFM.

The proposed evolution, with two different twists

We want to add two new abilities, while at the same time making the interface more intuitive (i.e., reading the function name should be enough to understand what it does)

Reading the group of failed processes without side effects

The first addition we want to make is a way to peek at the list of failed processes without changing what is acknowledged (or not). This can be achieved in a simple enough way by adding the following function:

MPIX_Comm_get_failed(comm, &fgrp)
MPI_Comm comm;
MPI_Group* fgrp;
  • This function obtains the current (local, non-synchronized) list of known failed processes.
  • Repeated calls to the function return a monotonically increasing group (i.e., the group produced by an earlier call is always a prefix of the group produced during the current call).
  • Two calls may produce different groups even if no MPI errors have been reported between the two calls.
  • If a communication has raised a Process Failure error, the process that caused the error to be raised is in the produced group in a later call to MPIX_Comm_get_failed.

So, this is simple enough. If a process needs to know what processes are known to have failed thus far, without incurring the cost of synchronizing that knowledge with other ranks, that’s what you get from this function. Note that it means that another rank calling that function “at the same time” may get a different group (if that’s not your use case, MPIX_Comm_agree, or MPIX_Comm_skrink do synchronize fault knowledge).

You also get to peek at everything that the MPI implementation knows. That is, using this function, you may know that a process is dead before any error has been raised in a communication operation.

The groups produced by a sequence of calls are monotonic, that is, the result of any previous call to MPIX_Comm_get_failed is a prefix to the current call. This will be important when we go to the next topic: controlling what we acknowledge.

Acknowledging only what we already know about, and knowing what we have acknowledged.

Know that we can produce a group of failed processes, lets use that knowledge to control how we acknowledge failed processes (and thus control any-source reception error behavior). In both variants we will also rely on previously described MPIX_Comm_get_failed.

Variant1: counting and acknowledged group

MPIX_Comm_failure_ack(comm, nack) MPI_Comm comm; int nack;

MPIX_Comm_failure_get_acked(comm, &agrp) MPI_Comm comm; MPI_Group* fgrp;

The first function MPIX_Comm_failure_ack lets the user code aknowledge (an re-enable any-source), but for the given maximum number of processes nack. That is, if a code looks at the fgrp and is satisfied that this is a correctable condition and any-source shall resume, MPI_GROUP_SIZE(fgrp, &nack) prevents from acknowledging not-yet considered failures that may have been detected in the meantime. That is, we now have a way to know in advance what we are acknowledging.

Note that if nack is smaller than what has been acknowledged in a previous call, the call is a no-op. Conversely, if nack is larger than nfailed, the size of the fgrp that would be produced by a concurrent MPIX_Comm_get_failed call, then only nfailed processes are acknowledged.

The second function MPIX_comm_failure_get_acked obtains the group of currently acknowledged faulty processes.

In simple cases (for example, when nack <= nfailed), one can simply reason about how many members of fgrp have been acknowledged. However, if nack > nfailed, or when multiple software modules or threads are issuing MPIX_Comm_failure_ack, one may want to see what is currently acknowledged without having to rely on shared variables.

Thus, MPIX_comm_failure_get_acked produces the group agrp, the group of currently acknowledged failed processes. If no calls to MPIX_Comm_failure_ack are issued on comm (possibly from another software module or thread), the function keeps producing similar groups.

Variant2: Counting only

The first interface variant achieves almost all of the goals we have set. The only limitation is that one cannot just count how many processes have been acknowledged without producing the group agrp. In some use cases (e.g., monte carlo), the only important criterion is how many workers are left, not what workers are still operating, hence producing agrp can be seen as pure overhead.

A solution to that problem can be achieved by changing the acknowledging function so that it returns how many members in fgrp have actually been acknowledged by the call.

MPIX_Comm_failure_ack_failed(comm, nack, &nacked)
MPI_Comm comm;
int nack;
int* nacked;

This function does the same as the variant 1 above, but in addition it produces in nacked the size of the agrp group produced by a later call to MPIX_Comm_faillure_get_acked. This permits counting the number of acknowledged (and thus failed) processes without ever building fgrp or agrp.

More Example use cases

Code does not do any-source, just looking at the group of failed processes without incurring side effects.
Resuming any-source unconditionally, a common usage pattern in Monte-Carlo codes. Counter based (variant2) can also count the number of failed processes for free.
In the counting variant (variant2), one can obtain the acked group without using the additional function MPIX_Comm_failure_get_acked (useful if MPIX_Comm_get_failed has been called before).
Comparison between variant1, variant2, and old ACK/GET_ACKED.