Categories
User Level Failure Mitigation

SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Salt Lake City, UT !!!

Categories
User Level Failure Mitigation

SC’15 tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

Get the slides part1, part2, and the examples

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Austin, TX !!!

Categories
User Level Failure Mitigation

Tutorial @ SC'14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

            <![CDATA[Reliability is one of the major concerns when envisioning the future Exascale platforms. The <a title="IESP" href="http://www.exascale.org" target="_blank" rel="noopener noreferrer">IESP</a> projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption (MTTI) of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable and take them into account by integrating some form of fault-tolerance.

One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.
In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
1. Download the version prepared for this tutorial.
2. Untar it in some convenient location.

tar jxvf ulfm-4419d3f7cee3.tbz
  1. Go inside the newly untarred directory and launch
./autogen.pl
  1. Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.
./configure --prefix=... --enable-debug --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default --enable-mpi-ext=ftmpi --with-ft=mpi
  1. Compile, link and install
make -j 4 install
  1. Add the directory provided as a –prefix in Step 4 to your PATH and LD_LIBRARY_PATH. As an example for bash one can add the following two lines to ${HOME}/.bashrc
export PATH=<>/bin:$PATH
export LD_LIBRARY_PATH=<>/lib:$LD_LIBRARY_PATH.
  1. You’re almost ready for the hand-on session.
  2. The archive with the examples and skeletons is available here.
    Let’s rock the faults!
    The slides used during this tutorial are available here, here and here.]]>