Simplifying the ACK/GET_ACKED couple

In this post, we will discuss two alternative proposals to simplify the MPIX_COMM_FAILURES_ACK and MPIX_COMM_FAILURES_GET_ACKED in the User Level Failure Mitigation MPI specification draft. These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc. Dealing with wildcard receptions These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from MPI_ANY_SOURCE. Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed. The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in Continue reading Simplifying the ACK/GET_ACKED couple