Research – Fault Tolerance Research Hub

FTXS’22 Paper: Implicit Actions and Non-blocking Failure Recovery with MPI

Posted on October 28, 2022 by Aurelien Bouteiller

Our paper, Implicit Actions and Non-blocking Failure Recovery with MPI, has been accepted at the FTXS’22 workshop. This paper presents recent evolutions to the ULFM specification that enable a more asynchronous, implicit style of triggering recovery action, and enable overlap between recovering the application state and the MPI library state.

Simplifying the ACK/GET_ACKED couple

Posted on August 26, 2019 by Aurelien Bouteiller

In this post, we will discuss two alternative proposals to simplify the MPIX_COMM_FAILURES_ACK and MPIX_COMM_FAILURES_GET_ACKED in the User Level Failure Mitigation MPI specification draft. These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc. Dealing with wildcard receptions These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from MPI_ANY_SOURCE. Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed. The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in Continue reading Simplifying the ACK/GET_ACKED couple→

SC' 16 paper: Failure Detection and Propagation in HPC System

Posted on February 10, 2017 by George Bosilca

Continue reading SC' 16 paper: Failure Detection and Propagation in HPC System→

"Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems" Presented at SC'15

Posted on November 17, 2015 by herault

Continue reading "Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems" Presented at SC'15→

Logarithmic Revoke Routine

Posted on August 27, 2015 by Aurelien Bouteiller

Starting with ULFM-1.0, the implementation features a logarithmic revoke operation, with a logarithmically bound per-node communication degree. A paper presenting this implementation will be presented at EuroMPI’15. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods. Purpose of the Revoke Operation If the communication pattern of the application is complex, the occurrence of failures has the potential to deeply disturb the application and prevent an effective recovery from being implemented. Consider the example in the above figure: as long as no failure occurs, the processes are communicating in a point-to-point pattern (we decide to call plan A). Process Pk is waiting to receive a message from Pk-1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing a failure in plan A, and consider that P1 has failed. As only P2 communicates directly with P1, other processes do not Continue reading Logarithmic Revoke Routine→

Logarithmic Agreement Routine

Posted on August 27, 2015 by herault

Starting with ULFM-1.0, the implementation features a purely logarithmic agreement, with no single point of failure. A paper presenting this implementation will be presented during SC’15. We considered a practical agreement algorithm with the following desired properties: the unique decided value is the result of a combination of all values proposed by deciding processes (a major difference with a 1-set agreement), failures consist of permanent crashes in a pseudo-synchronous system (no data corruption, loss of message, or malicious behaviors are considered), the agreement favors the failure-free performance over the failure case, striving to exchange a logarithmic number of messages in the absence of failures. To satisfy this last requirement, we introduced a practical, intermediate property, called Early Returning: that is the capacity of an early deciding algorithm to return before the stopping condition (early or not) is guaranteed: as soon as a process can determine that the decision value is fixed (except if it fails itself), the process is allowed to return. However, because the process is allowed to return early, later failures may compel that process to participate in additional communications. Therefore, the decision must remain available after the processes return, in order to serve unexpected message exchanges until the stopping condition can be established. Unlike a regular early stopping algorithm, not all processes decide and Continue reading Logarithmic Agreement Routine→

ULFM Flyer

Posted on November 8, 2012 by George Bosilca — No Comments ↓

We have a flyer for User Level Failure Mitigation to show new results and design. It can be found at this link: SC12 ULFM Flyer.