When attending the tutorial please download the material used during the tutorial from the following links:
The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.
The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.
See you all in Denver, CO !!!