The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’23. This year, the tutorial will be in a single, full day session on Sunday, November 12th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.
- Introduction and Methods for Fault Tolerance slides A
- Performance Models slides B
- Handling Errors in MPI Applications slides C
- Recovering MPI Application slides D
- Examples sc23-examples
- docker ULFM Fault Tolerant MPI on your desktop
The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.
The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions (3, 4) introduce a more detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.
Enjoy our promotional video 😉
See you all in Dallas !!!