## SC’17 Tutorial

1. Theoretical Session slides
2. Practical Session slides
3. Examples
4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

See you all in Denver, CO !!!

## ULFM 2.0rc Docker package

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities.

### Using the Docker Image

1. Install Docker
• Docker can be seen as a “lightweight” virtual machine.
• Docker is available for a wide range of systems (MacOS, Windows, Linux).
• You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.)
2. In a terminal, Run

to verify that the docker installation works.
4. Source the docker aliases in a terminal, this will redirect the “make”
and “mpirun” command in the local shell to execute in the Docker machine.
5. Run some example to see how this works. Quick examples can be found in the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using

Have fun!

## ULFM 2.0

ULFM 2.0 release

The ICL ULFM team is happy to announce ULFM 2.0 a new implementation of the MPI extension handling faults in sync with the current Open MPI master. Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. For information about what is new in Open MPI please read www.open-mpi.

# Features

This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM chapter standard draft document.

• MPIX_ERR_PROC_FAILED when a process failure prevents the completion of an MPI operation.
• MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking wildcard source receive has failed.
• MPIX_ERR_REVOKED when one of the ranks in the application has invoked the MPI_Comm_revoke operation on the communicator.
• MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on the communicator at all ranks.
• MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new communicator where dead processes in comm were removed.
• MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault tolerant allreduce operation) on flag (with the operation bitwise or).
• MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of currently acknowledged failed processes.
• MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.

## Supported Systems

There are four main MPI network models available in Open MPI: “ob1”, “cm”,”yalla”, and “ucx”. Only “ob1” is adapted to support fault tolerance. “ob1” uses BTL (“Byte Transfer Layer”) components for each supported network. “ob1” supports a variety of networks that can be used in combination with each other. Collective operations (blocking and non-blocking) use an optimized implementation on top of “ob1”.

• Loopback (send-to-self)
• TCP
• OpenFabrics: InfiniBand, iWARP, and RoCE
• uGNI (Cray Gemini, Aries)
• Shared memory Vader (FT supported w/CMA, XPmem, KNEM untested)
• Tuned, and non-blocking collective communications

A full list of supported, untested and disabled components is provided below.

More information (tutorials, examples, build instructions for leading Top 500 systems) is also available in the Fault Tolerance ResearchHub website:https://fault-tolerance.org

## Bibliographic References

If you are looking for, or want to cite a general reference for ULFM, please use

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra: Post-failure recovery of MPI communication capability: Design and rationale. IJHPCA 27(3): 244-254 (2013). Available from: http://journals.sagepub.com/doi/10.1177/1094342013488238.

# Building ULFM Open MPI

There are many available configure options (see ./configure --help for a full list); a summary of the more commonly used ones is included in the upstream Open MPI README file. The following paragraph gives a summary of ULFM Open MPI specific options behavior.

## Configure options

• --with-ft=TYPE Specify the type of fault tolerance to enable. Options: mpi (ULFM MPI draft standard). Fault tolerance support is enabled by default(as if --with-ft=mpi were implicitly present on the configure line).You may specify --without-ft to compile an almost stock Open MPI.
• --with-platform=FILE Load configure options for the build from FILE. When --with-ft=mpi is set, the file contrib/platform/ft_mpi_ulfm is loaded by default. This file disables components that are known to not be able to sustain failures, or are insufficiently tested. You may edit this file and/or force back these options on the command line to enable these components.
• --enable-mca-no-build=LIST Comma-separated list of pairs that will not be built. For example, --enable-mca-no-build=btl-portals,oob-ud will disable building the portals BTL and the ud OOB component. When --with-ft=mpi is set, this list is populated with the content of the aforementioned platform file. You may override the default list with this parameter.
• --with-pmi and --with-slurm Force the building of SLURM scheduler support. Slurm with fault tolerance is tested. Do not use srun, otherwise your application gets killed by the RM upon the first failure. Instead, use mpirun in an salloc/sbatch.
• --with-sge This is untested with fault tolerance.
• --with-tm Force the building of PBS/Torque scheduler support. PBS is tested with fault tolerance. Use mpirun in a qsub allocation.
• --disable-oshmem Disable building the OpenSHMEM implementation (by default, it is enabled). ULFM Fault Tolerance does not apply to OpenSHMEM.

## Modified, Untested and Disabled Components

Frameworks and components which are not listed in the following list are unmodified and support fault tolerance. Listed frameworks may be modified(and work after a failure), untested (and work before a failure, but may malfunction after a failure), or disabled (they prevent the fault tolerant components from operating properly).

• pml MPI point-to-point management layer
• “ob1” modified to handle errors
• “monitoring”, “v” unmodified, untested
• “bfo”, “cm”, “crcpw”, “ucx”, “yalla” disabled
• btl Point-to-point Byte Transfer Layer
• “openib”, “tcp”, “vader(+cma)” modified to handle errors (removed unconditional abort on error, expect performance similar to upstream)
• “portals4”, “scif”, “smcuda”, “usnic”, “vader(+knem,+xpmem)” unmodified, untested (may work properly, please report)
• mtl Matching Transport Layer used for MPI point-to-point messages on some types of networks
• All “mtl” components are disabled
• coll MPI collective algorithms
• “tuned”, “basic”, “nbc” modified to handle errors
• “cuda”, “fca”, “hcoll”, “portals4” unmodified, untested (expect unspecified post-failure behavior)
• osc MPI one-sided communications
• Unmodified, untested (expect unspecified post-failure behavior)
• io MPI I/O and dependent components
• fs File system functions for MPI I/O
• fbtl File byte transfer layer: abstraction for individual read/write operations for OMPIO
• fcoll Collective read and write operations for MPI I/O
• sharedfp Shared file pointer operations for MPI I/O
• All components in these frameworks are unmodified, untested (expect clean post-failure abort)
• vprotocol Checkpoint/Restart components
• “pml-v”, “crcp” unmodified, untested
• modified to handle errors (added a global interrupt to trigger all wait_sync objects)

# Running ULFM Open MPI

As ULFM is an extension to the MPI standard, you will need to #include "mpi-ext.h" in C, or use mpi_ext in Fortran to access the supplementary error codes and functions. Compile your application as usual, using the provided mpicc, mpif90, or mpicxx wrappers.

You can launch your application with fault tolerance by simply using the provided mpiexec. Beware that your distribution may already provide a version of MPI, make sure to set your PATH and LD_LIBRARY_PATH properly. Note that fault tolerance is enabled by default in ULFM Open MPI; you can disable all fault tolerance capabilities by launching your application with mpiexec --disable-recovery.

## Running under a batch scheduler

ULFM can operate under a job/batch scheduler, and is tested routinely with both PBS and Slurm. One difficulty comes from the fact that many job schedulers will “cleanup” the application as soon as a process fails. In order to avoid this problem, it is preferred that you use mpiexec within an allocation (e.g. salloc, sbatch, qsub) rather than a direct launch (e.g. srun).

## Run-time tuning knobs

ULFM comes with a variety of knobs for controlling how it runs. The default parameters are sane and should result in good performance in most cases. You can change the default settings with --mca mpi_ft_foo.

• orte_enable_recovery &lt;true|false&gt; (default: true) controls automatic cleanup of apps with failed processes within mpirun. The default differs from upstream Open MPI.
• mpi_ft_enable &lt;true|false&gt; (default: true) permits turning off fault tolerance without recompiling. Failure detection is disabled. Interfaces defined by the fault tolerance extensions are substituted with corresponding non-fault tolerant implementations (e.g. MPI_Allreduce is substituted to MPIX_Comm_agree).
• mpi_ft_verbose  (default: 0) increases the output of the fault tolerance activities. A value of 1 will report detected failures.
• mpi_ft_detector  &lt;true|false&gt; (default: false) controls the activation of the OMPI level failure detector. When this detector if turned off, all failure detection is delegated to ORTE, which may be slow and/or incomplete.
• mpi_ft_detector_thread &lt;true|false&gt; (default: false) controls the use of a thread to emit and receive failure detector’s heartbeats. Setting this value to “true” will also set MPI_THREAD_MULTIPLE support, which has a noticeable effect on latency. You may want to enable this option if you experience false positive processes incorrectly reported as failed.
• mpi_ft_detector_period  (default: 1e-1 seconds) heartbeat period. Recommended value is 1/3 of the timeout. Values lower than 100us may impart a noticeable effect on latency (typically a 3us increase).
• mpi_ft_detector_timeout  (default: 3e-1 seconds) heartbeat timeout (i.e. failure detection speed). Recommended value is 3 times the heartbeat period.

## Known Limitations in ULFM-2.0-rc (2e75c73):

• TOPO, FILE, RMA have little support for fault tolerance mechanisms.
• There is a tradeoff between failure detection accuracy and performance. The current default is to favor performance. Users that experience detection accuracy issues may enable a more precise mode.
• The failure detector operates on MPI_COMM_WORLD exclusively. Processes connected from MPI_COMM_CONNECT/ACCEPT and MPI_COMM_SPAWN may occasionally not be detected when they fail.
• Failures during NBC collective may not be recovered properly in some cases.
• Return of OpenIB credits spent toward a failed process can take several seconds. Until the btl_openib_ib_timeout and btl_openib_ib_retry_count controlled timeout triggers, lack of send credits may cause a temporary stall under certain communication patterns. Impacted users can try to adjust the value of these mca parameters.

# Changelog

## Release 2.0

Focus has been toward integration with current Open MPI master, performance, and stability.

• ULFM is now based upon Open MPI master branch (#689f1be9). It will be regularly updated until eventually merged.
• Fault Tolerance is enabled by default and is controlled with MCA variables.
• Added support for non-blocking collective operations (NBC).
• Added support for advanced failure detection at the MPI level. Implements the algorithm described in “Failure detection and propagation in HPC systems.” https://doi.org/10.1109/SC.2016.26.
• Removed the need for special handling of CID allocation.
• Non-usable components are automatically removed from the build during configure
• RMA, FILES, and TOPO components are enabled by default, and usage in a fault tolerant execution warns that they may cause undefined behavior after a failure.
• Bugfixes:
• Code cleanup and performance cleanup in non-FT builds; –without-ft at configure time gives an almost stock Open MPI.
• Code cleanup and performance cleanup in FT builds with FT runtime disabled; –mca ft_enable_mpi false thoroughly disables FT runtime activities.
• Some error cases would return ERR_PENDING instead of ERR_PROC_FAILED in collective operations.
• Some test could set ERR_PENDING or ERR_PROC_FAILED instead of ERR_PROC_FAILED_PENDING for ANY_SOURCE receptions.

## Release 1.1

Focus has been toward improving stability, feature coverage for intercomms, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

• Forked from Open MPI 1.5.5 devel branch
• Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
• Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed -extension- names MPIX_ERR_xxx.
• Support for Intercommunicators:
• Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
• MPI_COMM_REVOKE tested on intercommunicators.
• Disabled completely (.ompi_ignore) many untested components.
• Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
• Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
• Bugfixes:
• SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
• SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
• Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
• Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
• Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

## Release 1.0

Focus has been toward improving performance, both before and after the occurrence of failures. The list of new features includes:

• Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
• Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
• New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK.
• New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks.
• Improved support for our traditional network layer:
• TCP: fully tested
• SM: fully tested (with the exception of XPMEM, which remains unsupported)
• Added support for High Performance networks
• Open IB: reasonably tested
• uGNI: reasonably tested
• The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
• Back-ported PBS/ALPS fixes from Open MPI
• Back-ported OpenIB bug/performance fixes from Open MPI
• Improve Context ID allocation algorithm to reduce overheads of Shrink
• Miscellaneous bug fixes

## Version Numbers and Binary Compatibility

Starting from ULFM Open MPI version 2.0, ULFM Open MPI is binary compatible with the corresponding Open MPI master branch and compatible releases (see the binary compatibility and version number section in the upstream Open MPI README). That is, applications compiled with a compatible Open MPI can run with the ULFM Open MPI mpirun and MPI libraries. Conversely, as long as the application does not employ one of the MPIX functions, which are exclusively defined in ULFM Open MPI, an application compiled with ULFM Open MPI can be launched with a compatible Open MPI mpirun and run with the non-fault tolerant MPI library.

# Contacting the Authors

Found a bug? Got a question? Want to make a suggestion? Want to contribute to ULFM Open MPI? Working on a cool use-case? Please let us know! The best way to report bugs, send comments, or ask questions is to sign up on the user’s mailing list: ulfm+subscribe@googlegroups.com Because of spam, only subscribers are allowed to post to these lists (ensure that you subscribe with and post from exactly the same e-mail address — joe@example.com is considered different thanjoe@mycomputer.example.com!). Visit these pages to subscribe to the lists: https://groups.google.com/forum/#!forum/ulfm When submitting questions and problems, be sure to include as much information as possible. This web page details all the information that we request in order to provide assistance: http://www.open-mpi.org/community/help/

## Running on Edison

ULFM configuration for NERSC Edison

Edison is a Cray XC30, with a peak performance of 2.57 petaflops/sec, 133,824 compute cores, 357 terabytes of memory, and 7.56 petabytes of disk. Hosted at NERSC, it offers access to many researchers and practitioners in the US.

Installing the latest versions of Open MPI (including the development, unstable and stable) on this machine can easily be done by using the provided platform file. The same cannot unfortunately be said about the current ULFM version. This version was based on an older unstable version of Open MPI (1.7) and didnt backported all the new exciting features from the development branch of Open MPI. As a result, getting it to work on Edison is a little challenging, but we all like a little challenge.

### Preparing for installation

This step is unfortunately required, as I couldnt find any other way to update the m4 file of our ancient ULFM installation, without dedicating too much time to such a hopeless task (especially as the new ULFM 2.0 is reaching almost stable status). The issue is triggered by a misconfiguration of autoconf 2.69 on Edison, which breaks all the autotool chain. The issue can be easily corrected by downloading the Mercurial version of ULFM on another computer, running autogen.sh, a quick configure with no arguments, followed by “make dist”. This will generate 2 archives (openmpi-1.7.1ULFM.tar.bz2 and openmpi-1.7.1ULFM.tar.gz). Use the one corresponding to your preferred compression algorithm, and move it on Edison.

### Compiling from source

With the archive moved on Edison, you can now undergo the last step to configure and compile ULFM there. Lets assume we want to build an optimized version, integrated with Edison resource management software (Slurm), using uGNI, without debugging information. One of the interesting features ULFM inherited from Open MPI is the capability of using platform files. Here is the platform file for Edison, it will eventually make its way into the official ULFM release, meanwhile you can copy and paste directly from here.

### Running ULFM applications in an interactive job

Congratulations, the hardest steps are now done and you are now supposed to have a fully compiled and ready to run version of ULFM, and its binaries in your \$PATH. Allocating an interactive job on Edison is well described in the online documentation. There is however a trick, you need to add an option to prevent Slurm from destroying your job when one of the processes disappear. This option is no-kill.

Thus, allocating an interactive job with 4 processes for a short duration can be achieved with salloc -p debug -N 4 --no-kill.
Running an application then become easy, as indicated in ULFM setup steps.

You code has now a sane execution environment that is able to survive faults, and can help your application achieve the same goal.

## Try the Docker packaged ULFM fault tolerant MPI

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own.

##### Content:

1. A Docker Image with a precompiled version of ULFM Open MPI 1.1.
2. The tutorial hands-on example.
3. Various tests and benchmarks for resilient operations.
4. The sources for the ULFM Open MPI branch release 1.1.

##### Using the Docker Image

1. Install Docker
You can install Docker quickly, either by downloading one of the official builds from http://docker.io for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation instructions for your system.
2. In a terminal, verify that the docker installation works by running

3. Unpack the package:

4. Source the docker aliases, which will redirect the “make” and “mpirun” command in this terminal’s local shell to execute the provided commands from the Docker machine.

5. Go to the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using mpirun -am ft-enable-mpi -np 10 example. Note the special -am ft-enable-mpi` parameter; if this parameter is omitted, the non-fault tolerant version of Open MPI is launched and applications containing failures will automatically abort.

Have fun!

## ULFM Specification update

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meetings in 2016 has been posted under the ULFM Specification item.

This new update has very few semantic changes. It clarifies the failure behavior of MPI_COMM_SPAWN, and corrects the output values of error codes and status objects returned from functions completing in error.

## SC’ 16 paper: Failure Detection and Propagation in HPC System

We have presented our paper on the scalable failure detection, “Failure Detection and Propagation in HPC System” at SC’16.

Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds.
The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior.
The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the ORNL Titan supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm.

The slides presented by Yves are here and the paper is also available.

This failure detector is implemented in the ULFM 2.0, an effort that will soon become open to public.

## SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker.

See you all in Salt Lake City, UT !!!

## Logarithmic Revoke Routine

Starting with ULFM-1.0, the implementation features a logarithmic revoke operation, with a logarithmically bound per-node communication degree. A paper presenting this implementation will be presented at EuroMPI’15.

The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.

Purpose of the Revoke Operation

If the communication pattern of the application is complex, the occurrence of failures has the potential to deeply disturb the application and prevent an effective recovery from being implemented. Consider the example in the above figure: as long as no failure occurs, the processes are communicating in a point-to-point pattern (we decide to call plan A). Process Pk is waiting to receive a message from Pk-1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing a failure in plan A, and consider that P1 has failed. As only P2 communicates directly with P1, other processes do not detect this condition, and only P2 is informed of the failure of P1. The situation at P2 now raises a dilemma: P3 waits on P2, a non-failed process, therefore the operation must block until the matching send is posted at P2; however, P2 knows that P1 has failed, and that the application should branch into its recovery procedure plan B; if P2 were to switch abruptly to plan B, it would cease matching the receives P3 posted following plan A. At this point, P2 needs an effective way of interrupting operations that it does not intend to match anymore, otherwise, the application would reach a deadlock: the messages that P3 to Pn are waiting for will never arrive.

The proposed solution to resolve this scenario is that, before switching to plan B, the user code in P2 calls MPI_COMM_REVOKE, a new API which notifies all other processes in the communicator that a condition requiring recovery actions has been reached. Thanks to this flexibility, the cost associated with consistency in error reporting is paid only after an actual failure has happened, and only when necessary to the algorithm, and applications that do not need consistency, or in which the user can prove that the communication pattern remains safe, can enjoy better recovery performance.

When a process of the application calls MPI_COMM_REVOKE (similar operations exist for windows and files), all other alive processes in the communicator eventually receive a notification. The MPI_COMM_REVOKE call has an effect on the entire scope of the communicator, without requiring a collective or matching call at any participant. Instead, the effect of the Revoke operation is observed at other processes during non-matching MPI communication calls: when receiving this notification, any communication on the communicator (ongoing or future) is interrupted and a special error code returned. Then, all surviving processes can safely enter the recovery procedure of the application, knowing that no alive process belonging to that communicator will deadlock as a result.

The revocation notification needs to be propagated to all alive processes in the specified communicator, even when new failures happen during the Revoke propagation. Therefore, it is in essence a reliable broadcast. Among the four defining qualities of a reliable broadcast usually considered in the literature (Termination, Validity, Integrity, Agreement) [1], the non-uniform variants of the properties are sufficient, and the integrity criteria can be relaxed in the context of the Revoke algorithm. These simplified requirements are crucial for decreasing the cost of the Revoke operation, as the size of the messages and the number of message exchanges rounds can be drastically increased when one needs to implement an ordered, uniform reliable broadcast. Given the non-uniform agreement, the no-ordering, and loose integrity properties, in the Revoke reliable broadcast, a process that receives its first Revoke message can perform a single round of emissions to all its neighbors, with a constant message size, and then deliver the Revoke notification immediately, without further verification.

The last important aspect is the topology of the overlay network employed to perform the broadcast operation. In the reliable broadcast algorithm, when a process receives a broadcast message for the first time, it immediately broadcasts that same message to all its neighbors in the overlay graph. The agreement property can be guaranteed only when failures do not disconnect the overlay graph. In early prototype versions of the ULFM implementation, the reliable broadcast procedure employed a fully connected network (which guarantees that disconnected cliques never form). Obviously, this method scales poorly as, with the number or processes, the graph degree is linear, and the number of exchanged messages is quadratic. In practice, at scale, the large graph degree resulted in the application aborting due to resource exhaustion (too many open channels simultaneously, not enough memory for unexpected messages, etc.). Therefore, one needs to consider a more scalable overlay topology with a low graph degree that can yet maintain connectivity when nodes are suppressed.

Binomial Graph Overlay Topology

The Binomial Graph (BMG), introduced in [2], is a topology that features both opposing traits of a small degree, yet a strong resistance to the formation of disconnected cliques when nodes fail. A BMG is an undirected graph G=(V,E), where the vertices V represent a set of processes, and the edges E are a set of links forming an overlay network between these processes. Each vertex is given a unique identifier (i.e. the rank of the process). For each vertex v, there is a link to a set of vertices $$W={v+1, v+2, …, v+2^k~|~2^k < n}$$ Intuitively, a binomial graph can be seen as the union of all the binomial trees rooted at all vertices.

The BMG topology is proven to feature several desirable properties. It is a regular graph topology, in which all nodes have the same degree, even in graphs with unremarkable number of vertices. The degree is logarithmic with the number of nodes, therefore scalable. Meanwhile, it retains a small diameter and a small average distance (in number of hops) between any two nodes (also logarithmic). In addition, a binomial broadcast tree rooted at any node can be naturally extracted from a BMG.

Last, the BMG topology has a high node-connectivity —the minimum number of nodes whose removal can result in disconnecting the network—, which is equal to the degree D. As a consequence the BMG is D-1 node fault tolerant in all cases (an optimal result for a graph of degree D). The probability distribution for the formation of a disconnected graph when D or more failures happen is very favorable (the failures have to strike a particular set of nodes, in a variant of the generalized birthday problem).

Evaluation

Because of its asymmetrical nature, the impact of the Revoke call cannot be measured directly. At the initiator, the call only starts a non-synchronizing wave of token circulation, and measuring the very short duration of the call is not representative of the actual time required for the Revoke call to operate at all target processes. Measuring the time needed for a particular operation to be interrupted gives a better estimate of the propagation time of a Revoke notification. However, the overall impact remains underestimated if one doesn’t account for the fact that even after all processes have successfully delivered a Revoke notification, the reliable broadcast algorithm continues to emit and handle Revoke messages in the background for some time.

The benchmark we designed measures both the duration and the perturbation generated by the progress of a Revoke operation on the network. The benchmark comprises two communication plans. Plan A is a loop that performs a given collective operation on a communicator commA. At some iteration, an initiator process does not match the collective operation, but, instead, Revokes commA, which effectively ends plan A. Plan B is a similar loop performing the same collective operation in a duplicate communicator commB (thus the Revoke operation on commA does not interrupt operations in plan B). The first (and last) operation in plan A gives an estimate of the Revoke propagation time. Operations in plan B estimate the residual perturbation from echoing messages in the reliable broadcast.

The above figure presents the scalability of the AllReduce collective communication in the Revoke benchmark on the Darter Cray xc30 supercomputer (tuned collective module, uGNI transport). The first observation is that the performance of post-Revoke collective communications follows the same scalability trend as the pre-Revoke operations, even those impacted by jitter. Aside from the first post-Revoke AllReduce communication, which still exhibit a moderate overhead from jitter, the second post-Revoke AllReduce is only mildly impacted and the third AllReduce exhibit no significant difference from the failure free case, illustrating that the jitter introduced by the reliable broadcast algorithm has a low impact on this communication pattern. When the number of processes increases, the impact of jitter —the difference between the failure-free and the first post-Revoke operation— is almost constant (or slightly decreasing). If this trend were to continue at larger scales, the impact of jitter could become asymptotically negligible.

[1] V. Hadzilacos and S. Toueg. Fault-tolerant broadcasts and related problems. In S. Mullender, editor, Distributed systems (2nd Ed.), chapter 5, pages 97–145. ACM/Addison-Wesley, 1993.

[2] T. Angskun, G. Bosilca, and J. Dongarra. Binomial graph: A scalable and fault-tolerant logical network topology. In I. Stojmenovic, R. Thulasiram, L. Yang, W. Jia, M. Guo, and R. de Mello, editors, Parallel and Distributed Processing and Applications, volume 4742 of Lecture Notes in Computer Science, pages 471–482. Springer Berlin Heidelberg, 2007. Continue reading “Logarithmic Revoke Routine”

## Logarithmic Agreement Routine

Starting with ULFM-1.0, the implementation features a purely logarithmic agreement, with no single point of failure. A paper presenting this implementation will be presented during SC’15.

We considered a practical agreement algorithm with the following desired properties:

1. the unique decided value is the result of a combination of all values proposed by deciding processes (a major difference with a 1-set agreement),
2. failures consist of permanent crashes in a pseudo-synchronous system (no data corruption, loss of message, or malicious behaviors are considered),
3. the agreement favors the failure-free performance over the failure case, striving to exchange a logarithmic number of messages in the absence of failures.

To satisfy this last requirement, we introduced a practical, intermediate property, called Early Returning: that is the capacity of an early deciding algorithm to return before the stopping condition (early or not) is guaranteed: as soon as a process can determine that the decision value is fixed (except if it fails itself), the process is allowed to return. However, because the process is allowed to return early, later failures may compel that process to participate in additional communications.  Therefore, the decision must remain available after the processes return, in order to serve unexpected message exchanges until the stopping condition can be established. Unlike a regular early stopping algorithm, not all processes decide and stop at the same round, and some processes participate in more message exchanges — depending on the location of failed processes in the logarithmic communication topology.

Through synthetic benchmarks and real applications, we investigated the ERA costs, and highlighted the correspondence between the theoretical and practical logarithmic behavior of the proposed algorithm and the implementation. We have shown that using this algorithm, it is possible to design efficient applications, or domain specific fault mitigation building blocks that preserve the original application performance, while augmenting the capability of the applications to execute on future platforms where large scale may reduce reliability.

Performance of the Early Returning Agreement:

In the figure above, we present the scalability trend of ERA when no failures are disturbing the system.  We consider two different agreement implementations, 1) the known state-of-the-art 2-phase-commit Agreement algorithm presented in [1], called Log2phases, and 2) our best performing version of ERA. We also add, for reference, the performance of an Allreduce operation that in a failure-free context would have had the same outcome as the agreement. On the darter machine using one process per core, thus 16 processes per node, the average branching degree of non-leaf nodes is 2.125. The ERA and the Allreduce operations both exhibit a logarithmic trend when the number of nodes increase, as can be observed by the close fit (asymptotic standard error of $$0.6%$$) of the logarithmic function $$era(x)=6.7 \log_{2.125}(x)$$.

In contrast, the Log2phases algorithm exhibits a linear scaling with  the number of nodes, despite the expected theoretical bound proposed in [1]. Further evaluation show that some linear local computations during the agreement operation end up dominating the execution time, despite the logarithmic number of messages.

In the figure above, we compare the performance of different architecture-aware of ERA employing different tree topologies to perform the reduce/broadcast operations that are the basis of an ERA phase. In the flat binary tree, all ranks are organized in a binary tree, regardless of the hardware locality of ranks collocated on cores of the same node. In the hierarchical methods, one rank represents the node and participates in the inter-node binary tree; on each node, collocated ranks are all children of the representing rank in the bin/star method, or are organized along a node-local binary tree in the bin/bin method. The flat binary topology ERA and the Open MPI Allreduce are both hardware locality agnostic; their performance profiles are extremely similar. In contrast, the Cray Allreduce exhibits a better scalability thanks to accounting for the locality of ranks. Counterintuitively, the bin/star hierarchical topology performs worse than the flat binary tree: the representing rank for a node has 16 local children and the resulting 16 sequential memcpy operations (on the shared-memory device) take longer than the latency to cross the supplementary long-range links. In the bin/bin topology, most of these memory copies are parallelized and henceforth the overall algorithm scales logarithmically. When compared with the fully optimized, non fault tolerant Allreduce, the latency is doubled, which is a logical consequence of the need for the ERA operation to sequentialize a reduce and a broadcast that do not overlap (to ensure the consistent decision criterion in the failure case), while the Allreduce operation is spared that requirement and can overlap multiple interweaved reductions.

Performance of Applications using the Early Returning Agreement:

S3D is a highly parallel method-of-lines solver for partial differential equations and is used to perform first-principles-based direct numerical simulations of turbulent combustion. It employs high order explicit finite difference numerical schemes for spatial and temporal derivatives, and realistic physics for thermodynamics, molecular transport, and chemical kinetics.

Fenix is a framework aimed at enabling online (i.e., without disrupting the job) and transparent recovery from process, node, blade, and cabinet failures for parallel applications in an efficient and scalable manner. Fenix encapsulates mechanisms to transparently capture failures through ULFM return codes, re-spawn new processes on spare nodes when possible, fix failed communicators using ULFM capabilities, restore application state, and return the execution control back to the application.

We use S3D augmented with Fenix to test the effect of the new agreement algorithm on the recovery process when compared to the baseline agreement algorithm. The figures below shows the results of these experiments in terms of total absolute cost of each call to the shrink operation. The MPIX_COMM_SHRINK operation, which uses the agreement algorithm, has been identified in [2] as the most time consuming operation of the recovery process.