SC’17 Tutorial

When attending the tutorial please download the material used during the tutorial from the following links:

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Denver, CO !!!

SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Salt Lake City, UT !!!

SC’15 tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

Get the slides part1, part2, and the examples

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Austin, TX !!!

Tutorial @ SC’14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

Reliability is one of the major concerns when envisioning the future Exascale platforms. The IESP projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption (MTTI) of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable and take them into account by integrating some form of fault-tolerance.

One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.

In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
1. Download the version prepared for this tutorial.
2. Untar it in some convenient location.

3. Go inside the newly untarred directory and launch

4. Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.

5. Compile, link and install

6. Add the directory provided as a –prefix in Step 4 to your PATH and LD_LIBRARY_PATH. As an example for bash one can add the following two lines to ${HOME}/.bashrc

7. You’re almost ready for the hand-on session.
8. The archive with the examples and skeletons is available here.

Let’s rock the faults!

The slides used during this tutorial are available here, here and here.

Preparing for June MPI Forum meeting

In preparation for the June MPI forum meeting, the specification has received some updates. The most prominent changes are:

  • The exposed memory in an RMA window may be completely undefined after a failure has occured.
  • MPI_Comm_agree now operates a binary AND on the flag argument.
  • Examples have been corrected to use error classes, instead of error codes, when relevant.

The latest version is available in the ULFM specification area