Tutorial @ SC'14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

            <![CDATA[Reliability is one of the major concerns when envisioning the future Exascale platforms. The <a title="IESP" href="http://www.exascale.org" target="_blank" rel="noopener noreferrer">IESP</a> projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption (MTTI) of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable and take them into account by integrating some form of fault-tolerance.

One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.
In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
1. Download the version prepared for this tutorial.
2. Untar it in some convenient location.

tar jxvf ulfm-4419d3f7cee3.tbz
  1. Go inside the newly untarred directory and launch
./autogen.pl
  1. Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.
./configure --prefix=... --enable-debug --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default --enable-mpi-ext=ftmpi --with-ft=mpi
  1. Compile, link and install
make -j 4 install
  1. Add the directory provided as a –prefix in Step 4 to your PATH and LD_LIBRARY_PATH. As an example for bash one can add the following two lines to ${HOME}/.bashrc
export PATH=<>/bin:$PATH
export LD_LIBRARY_PATH=<>/lib:$LD_LIBRARY_PATH.
  1. You’re almost ready for the hand-on session.
  2. The archive with the examples and skeletons is available here.
    Let’s rock the faults!
    The slides used during this tutorial are available here, here and here.]]>