SC’17 Tutorial

When attending the tutorial please download the material used during the tutorial from the following links:

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Denver, CO !!!

ULFM 2.0rc Docker package

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities.

Using the Docker Image

  1. Install Docker
    • Docker can be seen as a “lightweight” virtual machine.
    • Docker is available for a wide range of systems (MacOS, Windows, Linux).
    • You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.)
  2. In a terminal, Run

    to verify that the docker installation works.
  3. Load the pre-compiled ULFM Docker machine into your Docker installation
  4. Source the docker aliases in a terminal, this will redirect the “make”
    and “mpirun” command in the local shell to execute in the Docker machine.
  5. Run some example to see how this works. Quick examples can be found in the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using

Have fun!

Try the Docker packaged ULFM fault tolerant MPI

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own.

Content:

1. A Docker Image with a precompiled version of ULFM Open MPI 1.1.
2. The tutorial hands-on example.
3. Various tests and benchmarks for resilient operations.
4. The sources for the ULFM Open MPI branch release 1.1.

Using the Docker Image

1. Install Docker
You can install Docker quickly, either by downloading one of the official builds from http://docker.io for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation instructions for your system.
2. In a terminal, verify that the docker installation works by running

3. Unpack the package:

3. Load the pre-compiled ULFM Docker machine into your Docker installation:

4. Source the docker aliases, which will redirect the “make” and “mpirun” command in this terminal’s local shell to execute the provided commands from the Docker machine.

5. Go to the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using mpirun -am ft-enable-mpi -np 10 example. Note the special -am ft-enable-mpi parameter; if this parameter is omitted, the non-fault tolerant version of Open MPI is launched and applications containing failures will automatically abort.

Have fun!

SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Salt Lake City, UT !!!

SC’15 tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

Get the slides part1, part2, and the examples

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Austin, TX !!!

Tutorial @ SC’14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

Reliability is one of the major concerns when envisioning the future Exascale platforms. The IESP projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption (MTTI) of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable and take them into account by integrating some form of fault-tolerance.

One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.

In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
1. Download the version prepared for this tutorial.
2. Untar it in some convenient location.

3. Go inside the newly untarred directory and launch

4. Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.

5. Compile, link and install

6. Add the directory provided as a –prefix in Step 4 to your PATH and LD_LIBRARY_PATH. As an example for bash one can add the following two lines to ${HOME}/.bashrc

7. You’re almost ready for the hand-on session.
8. The archive with the examples and skeletons is available here.

Let’s rock the faults!

The slides used during this tutorial are available here, here and here.