Tutorial ULFM User Level Failure Mitigation

SC’20 Tutorial

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’20. This year, the tutorial will be split in two sessions over two days. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Monday the 9th material
    1. slides Introduction and Methods for Fault Tolerance
    2. slides Failures and Silent Errors
    3. slides Checkpointing: the Young/Daly formula
    4. slides In-memory & Hierarchical Checkpointing, Replication
    5. slides Silent Errors
  2. Tuesday the 10th material
    1. slides Handling Errors in MPI Applications
    2. slides Recovering MPI Application
    1. Examples
    2. docker ULFM Fault Tolerant MPI on your desktop

The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction [1, 2, 3, 4, 5] covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions [1, 2] introduce a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

Enjoy our promotional video 😉

See you all in online !!!