The User Level Failure Mitigation (ULFM) proposal was developed by the MPI Forum’s Fault Tolerance Working Group. Designing the mechanism that users would use to manage failures was built around three concepts: 1) simplicity, the API should be easy to understand and use in most common scenarios; 2) flexibility, the API should allow varied fault tolerant models to be built as external libraries and; 3) absence of deadlock, no MPI call (point-to-point or collective) can block indefinitely after a failure, but must either succeed or raise an MPI error.
To use this ULFM implementation, an MPI application must change the default error handler on (at least) MPI_COMM_WORLD from MPI_ERRORS_ARE_FATAL to either MPI_ERRORS_RETURN or a custom MPI Errorhandler.
Note that this implementation is provided to test correctness, not performance. When this has been merged into existing MPI implementations, performance and scalability will be addressed.
To contact the ULFM developers, we have established a mailing list with Google Groups. To send an email to the list, you will need to subscribe first by sending an email to email@example.com. Then you can send emails to firstname.lastname@example.org.
Thanks to Josh Hursey for originally writing much of this guide.