The User Level Failure Mitigation (ULFM) proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. In addition the design is centered around user needs and flexibility, the API should allow varied fault tolerant models to be built as external libraries.
To use an ULFM implementation, an MPI application must change the default error handler on (at least) MPI_COMM_WORLD from MPI_ERRORS_ARE_FATAL to either MPI_ERRORS_RETURN or a custom MPI Errorhandler. An implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the (ULFM chapter) standard draft document.
- MPIX_ERR_PROC_FAILED when a process failure prevents the completion of an MPI operation.
- MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking wildcard source receive has failed.
- MPIX_ERR_REVOKED when one of the ranks in the application has invoked the MPI_Comm_revoke operation on the communicator.
- MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on the communicator at all ranks.
- MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new communicator where dead processes in comm were removed.
- MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault tolerant allreduce operation) on flag (with the operation bitwise or).
- MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of currently acknowledged failed processes.
- MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.
From these new capabilities, a rich community of users and research have stemmed. The goal of this website is to serve as a hub for this community to access and share resources about designing fault tolerant applications.
To contact the ULFM developers, we have established a mailing list with Google Groups. To send an email to the list, you will need to subscribe first by sending an email to firstname.lastname@example.org. Then you can send emails to email@example.com.
If you are looking for, or want to cite a general reference for ULFM, please use
Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra: Post-failure recovery of MPI communication capability: Design and rationale. IJHPCA 27(3): 244-254 (2013).
Available from: http://journals.sagepub.com/doi/10.1177/1094342013488238.