Categories
Tutorial ULFM User Level Failure Mitigation

SC’20 Tutorial

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’20. This year, the tutorial will be split in two sessions over two days. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Monday the 9th material
    1. slides Introduction and Methods for Fault Tolerance
    2. slides Failures and Silent Errors
    3. slides Checkpointing: the Young/Daly formula
    4. slides In-memory & Hierarchical Checkpointing, Replication
    5. slides Silent Errors
  2. Tuesday the 10th material
    1. slides Handling Errors in MPI Applications
    2. slides Recovering MPI Application
    1. Examples
    2. docker ULFM Fault Tolerant MPI on your desktop

The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction [1, 2, 3, 4, 5] covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions [1, 2] introduce a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

Enjoy our promotional video 😉


See you all in online !!!

Categories
ULFM User Level Failure Mitigation

SC’19 Tutorial

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’19. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Denver !!!

Categories
ULFM User Level Failure Mitigation

SC’18 Tutorial

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’18 (somewhat similar to past incarnations). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Texas !!!

Categories
ULFM User Level Failure Mitigation

ULFM 2.1a1 Docker Package

There are many ways to install ULFM. For performance evaluation, large scale experiments and platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. The 2.1a1 Docker image is a bugfix release compared to the previous 2.0rc version.

Using the Docker Image

  1. Install Docker
    • Docker can be seen as a “lightweight” virtual machine.Docker is available for a wide range of systems (MacOS, Windows, Linux).You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g.  yum install docker, apt-get docker-io, port install docker-io, etc.)
  2. In a terminal, Run docker run hello-world to verify that the docker installation works.
  3. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm
  4. Source the docker aliases in a terminal, this will redirect the “make”
    and “mpirun” command in the local shell to execute in the Docker machine.
    1. alias make='docker run -v $PWD:/sandbox:Z abouteiller/mpi-ft-ulfm make' alias mpirun='docker run -v $PWD:/sandbox:Z abouteiller/mpi-ft-ulfm mpirun --oversubscribe -mca btl tcp,self'
  5. Run some example to see how this works. Quick examples can be found in the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using mpirun -np 10 example

Have fun!

Categories
ULFM

EuroMPI’18 tutorial

Following the success of the first joint tutorial with the VeloC team, we decided to follow-up with a second incarnation of this mixed tutorial at EuroMPI’18. Bogdan Nicolae, Franck Capello and George Bosilca will present this tutorial titled Resilience in parallel applications. The tutorial will two complementary fault management techniques to empower application developers to deal with various types of failures directly at application-level, increasing the opportunities to reduce the resilience overhead with a holistic support from all layers: hardware and software as well as from the parallel programming paradigm. The tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches:

  • application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and
  • user-level failure mitigation (as demonstrated through ULFM extension to the MPI standard).

The tutorial will use the following decks of slides: Introduction, VeloC and ULFM as well as a set of examples for VeloC and ULFM. For the hands-on the participants are expected to bring their own laptop, running either Windows, Linux or Mac OS X with Docker installed.

Using the Docker Image

  1. Install Docker
    • Docker can be seen as a “lightweight” virtual machine, a perfect way to quickly setup a tutorial execution environment. You will need basic knowledge about Docker that is available either from the documentation or a cheat sheet
    • Docker is available for a wide range of systems (MacOS, Windows, Linux).
    • You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.)
  2. Validate your Docker installation by running into a terminal
    docker run hello-world
  3. Load the pre-compiled ULFM Docker machine into your Docker installation
    docker pull bnicolae/veloc-tutorial

    which contains all libraries (ULFM and VeloC) needed to complete the tutorial

  4. Source the docker aliases in a terminal using source dockervars.sh or on windows call dockervars.bat (both shells files are in the example tarball). These aliases will redirect the “make”, “mpicc”, “mpif90”, “mpiexec” and “mpirun” command to execute in the Docker machine instead on the local environment (pretty nifty). Beware: the aliases should be loaded on each new shell where you want to play with the Docker.
  5. Get the tutorial hands-on, and untar the downloaded archive (Linux & Mac OSX tar -zxvf eurompi18-handson.tgz) and then go to the tutorial hands-on directory (cd eurompi18). Before going further, make sure the Docker aliases are correctly loaded (alias), or you will neither be able to compile nor run the examples. You can now type make to compile the examples, and you can execute the generated examples in the Docker machine using mpirun -np 4 *example*.
Categories
ULFM

SC’17 Tutorial

When attending the tutorial please download the material used during the tutorial from the following links:

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Denver, CO !!!

Categories
User Level Failure Mitigation

ULFM 2.0rc Docker package

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities.

Using the Docker Image

  1. Install Docker
    • Docker can be seen as a “lightweight” virtual machine.
    • Docker is available for a wide range of systems (MacOS, Windows, Linux).
    • You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.)
  2. In a terminal, Run

    to verify that the docker installation works.
  3. Load the pre-compiled ULFM Docker machine into your Docker installation
  4. Source the docker aliases in a terminal, this will redirect the “make”
    and “mpirun” command in the local shell to execute in the Docker machine.
  5. Run some example to see how this works. Quick examples can be found in the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using

Have fun!

Categories
User Level Failure Mitigation

Try the Docker packaged ULFM fault tolerant MPI

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own.

Content:

1. A Docker Image with a precompiled version of ULFM Open MPI 1.1.
2. The tutorial hands-on example.
3. Various tests and benchmarks for resilient operations.
4. The sources for the ULFM Open MPI branch release 1.1.

Using the Docker Image

1. Install Docker
You can install Docker quickly, either by downloading one of the official builds from http://docker.io for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation instructions for your system.
2. In a terminal, verify that the docker installation works by running

3. Unpack the package:

3. Load the pre-compiled ULFM Docker machine into your Docker installation:

4. Source the docker aliases, which will redirect the “make” and “mpirun” command in this terminal’s local shell to execute the provided commands from the Docker machine.

5. Go to the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using mpirun -am ft-enable-mpi -np 10 example. Note the special -am ft-enable-mpi parameter; if this parameter is omitted, the non-fault tolerant version of Open MPI is launched and applications containing failures will automatically abort.

Have fun!

Categories
User Level Failure Mitigation

SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Salt Lake City, UT !!!

Categories
User Level Failure Mitigation

SC’15 tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

Get the slides part1, part2, and the examples

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Austin, TX !!!