ULFM 2.0

ULFM 2.0 release

The ICL ULFM team is happy to announce ULFM 2.0 a new implementation of the MPI extension handling faults in sync with the current Open MPI master. Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. For information about what is new in Open MPI please read www.open-mpi.

Features

This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM chapter standard draft document.

  • MPIX_ERR_PROC_FAILED when a process failure prevents the completion of an MPI operation.
  • MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking wildcard source receive has failed.
  • MPIX_ERR_REVOKED when one of the ranks in the application has invoked the MPI_Comm_revoke operation on the communicator.
  • MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on the communicator at all ranks.
  • MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new communicator where dead processes in comm were removed.
  • MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault tolerant allreduce operation) on flag (with the operation bitwise or).
  • MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of currently acknowledged failed processes.
  • MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.

Supported Systems

There are four main MPI network models available in Open MPI: “ob1”, “cm”,”yalla”, and “ucx”. Only “ob1” is adapted to support fault tolerance. “ob1” uses BTL (“Byte Transfer Layer”) components for each supported network. “ob1” supports a variety of networks that can be used in combination with each other. Collective operations (blocking and non-blocking) use an optimized implementation on top of “ob1”.

  • Loopback (send-to-self)
  • TCP
  • OpenFabrics: InfiniBand, iWARP, and RoCE
  • uGNI (Cray Gemini, Aries)
  • Shared memory Vader (FT supported w/CMA, XPmem, KNEM untested)
  • Tuned, and non-blocking collective communications

A full list of supported, untested and disabled components is provided below.

More Information

More information (tutorials, examples, build instructions for leading Top 500 systems) is also available in the Fault Tolerance ResearchHub website:https://fault-tolerance.org

Bibliographic References

If you are looking for, or want to cite a general reference for ULFM, please use

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra: Post-failure recovery of MPI communication capability: Design and rationale. IJHPCA 27(3): 244-254 (2013). Available from: http://journals.sagepub.com/doi/10.1177/1094342013488238.

Building ULFM Open MPI

There are many available configure options (see ./configure --help for a full list); a summary of the more commonly used ones is included in the upstream Open MPI README file. The following paragraph gives a summary of ULFM Open MPI specific options behavior.

Configure options

  • --with-ft=TYPE Specify the type of fault tolerance to enable. Options: mpi (ULFM MPI draft standard). Fault tolerance support is enabled by default(as if --with-ft=mpi were implicitly present on the configure line).You may specify --without-ft to compile an almost stock Open MPI.
  • --with-platform=FILE Load configure options for the build from FILE. When --with-ft=mpi is set, the file contrib/platform/ft_mpi_ulfm is loaded by default. This file disables components that are known to not be able to sustain failures, or are insufficiently tested. You may edit this file and/or force back these options on the command line to enable these components.
  • --enable-mca-no-build=LIST Comma-separated list of pairs that will not be built. For example, --enable-mca-no-build=btl-portals,oob-ud will disable building the portals BTL and the ud OOB component. When --with-ft=mpi is set, this list is populated with the content of the aforementioned platform file. You may override the default list with this parameter.
  • --with-pmi and --with-slurm Force the building of SLURM scheduler support. Slurm with fault tolerance is tested. Do not use srun, otherwise your application gets killed by the RM upon the first failure. Instead, use mpirun in an salloc/sbatch.
  • --with-sge This is untested with fault tolerance.
  • --with-tm Force the building of PBS/Torque scheduler support. PBS is tested with fault tolerance. Use mpirun in a qsub allocation.
  • --disable-mpi-thread-multiple Disable the MPI thread level MPI_THREAD_MULTIPLE (it is enabled by default). Multiple threads with fault tolerance is lightly tested.
  • --disable-oshmem Disable building the OpenSHMEM implementation (by default, it is enabled). ULFM Fault Tolerance does not apply to OpenSHMEM.

Modified, Untested and Disabled Components

Frameworks and components which are not listed in the following list are unmodified and support fault tolerance. Listed frameworks may be modified(and work after a failure), untested (and work before a failure, but may malfunction after a failure), or disabled (they prevent the fault tolerant components from operating properly).

  • pml MPI point-to-point management layer
    • “ob1” modified to handle errors
    • “monitoring”, “v” unmodified, untested
    • “bfo”, “cm”, “crcpw”, “ucx”, “yalla” disabled
  • btl Point-to-point Byte Transfer Layer
    • “openib”, “tcp”, “vader(+cma)” modified to handle errors (removed unconditional abort on error, expect performance similar to upstream)
    • “portals4”, “scif”, “smcuda”, “usnic”, “vader(+knem,+xpmem)” unmodified, untested (may work properly, please report)
  • mtl Matching Transport Layer used for MPI point-to-point messages on some types of networks
    • All “mtl” components are disabled
  • coll MPI collective algorithms
    • “tuned”, “basic”, “nbc” modified to handle errors
    • “cuda”, “fca”, “hcoll”, “portals4” unmodified, untested (expect unspecified post-failure behavior)
  • osc MPI one-sided communications
    • Unmodified, untested (expect unspecified post-failure behavior)
  • io MPI I/O and dependent components
    • fs File system functions for MPI I/O
    • fbtl File byte transfer layer: abstraction for individual read/write operations for OMPIO
    • fcoll Collective read and write operations for MPI I/O
    • sharedfp Shared file pointer operations for MPI I/O
    • All components in these frameworks are unmodified, untested (expect clean post-failure abort)
  • vprotocol Checkpoint/Restart components
    • “pml-v”, “crcp” unmodified, untested
  • wait_sync Multithreaded wait-synchronization object
    • modified to handle errors (added a global interrupt to trigger all wait_sync objects)

Running ULFM Open MPI

Building your application

As ULFM is an extension to the MPI standard, you will need to #include "mpi-ext.h" in C, or use mpi_ext in Fortran to access the supplementary error codes and functions. Compile your application as usual, using the provided mpicc, mpif90, or mpicxx wrappers.

Running your application

You can launch your application with fault tolerance by simply using the provided mpiexec. Beware that your distribution may already provide a version of MPI, make sure to set your PATH and LD_LIBRARY_PATH properly. Note that fault tolerance is enabled by default in ULFM Open MPI; you can disable all fault tolerance capabilities by launching your application with mpiexec --disable-recovery.

Running under a batch scheduler

ULFM can operate under a job/batch scheduler, and is tested routinely with both PBS and Slurm. One difficulty comes from the fact that many job schedulers will “cleanup” the application as soon as a process fails. In order to avoid this problem, it is preferred that you use mpiexec within an allocation (e.g. salloc, sbatch, qsub) rather than a direct launch (e.g. srun).

Run-time tuning knobs

ULFM comes with a variety of knobs for controlling how it runs. The default parameters are sane and should result in good performance in most cases. You can change the default settings with --mca mpi_ft_foo.

  • orte_enable_recovery <true|false> (default: true) controls automatic cleanup of apps with failed processes within mpirun. The default differs from upstream Open MPI.
  • mpi_ft_enable <true|false> (default: true) permits turning off fault tolerance without recompiling. Failure detection is disabled. Interfaces defined by the fault tolerance extensions are substituted with corresponding non-fault tolerant implementations (e.g. MPI_Allreduce is substituted to MPIX_Comm_agree).
  • mpi_ft_verbose  (default: 0) increases the output of the fault tolerance activities. A value of 1 will report detected failures.
  • mpi_ft_detector  <true|false> (default: false) controls the activation of the OMPI level failure detector. When this detector if turned off, all failure detection is delegated to ORTE, which may be slow and/or incomplete.
  • mpi_ft_detector_thread <true|false> (default: false) controls the use of a thread to emit and receive failure detector’s heartbeats. Setting this value to “true” will also set MPI_THREAD_MULTIPLE support, which has a noticeable effect on latency. You may want to enable this option if you experience false positive processes incorrectly reported as failed.
  • mpi_ft_detector_period  (default: 1e-1 seconds) heartbeat period. Recommended value is 1/3 of the timeout. Values lower than 100us may impart a noticeable effect on latency (typically a 3us increase).
  • mpi_ft_detector_timeout  (default: 3e-1 seconds) heartbeat timeout (i.e. failure detection speed). Recommended value is 3 times the heartbeat period.

Known Limitations in ULFM-2.0-rc (2e75c73):

  • TOPO, FILE, RMA have little support for fault tolerance mechanisms.
  • There is a tradeoff between failure detection accuracy and performance. The current default is to favor performance. Users that experience detection accuracy issues may enable a more precise mode.
  • The failure detector operates on MPI_COMM_WORLD exclusively. Processes connected from MPI_COMM_CONNECT/ACCEPT and MPI_COMM_SPAWN may occasionally not be detected when they fail.
  • Failures during NBC collective may not be recovered properly in some cases.
  • Return of OpenIB credits spent toward a failed process can take several seconds. Until the btl_openib_ib_timeout and btl_openib_ib_retry_count controlled timeout triggers, lack of send credits may cause a temporary stall under certain communication patterns. Impacted users can try to adjust the value of these mca parameters.

Changelog

Release 2.0

Focus has been toward integration with current Open MPI master, performance, and stability.

  • ULFM is now based upon Open MPI master branch (#689f1be9). It will be regularly updated until eventually merged.
  • Fault Tolerance is enabled by default and is controlled with MCA variables.
  • Added support for multithreaded modes (MPI_THREAD_MULTIPLE, etc.)
  • Added support for non-blocking collective operations (NBC).
  • Added support for CMA shared memory transport (Vader).
  • Added support for advanced failure detection at the MPI level. Implements the algorithm described in “Failure detection and propagation in HPC systems.” https://doi.org/10.1109/SC.2016.26.
  • Removed the need for special handling of CID allocation.
  • Non-usable components are automatically removed from the build during configure
  • RMA, FILES, and TOPO components are enabled by default, and usage in a fault tolerant execution warns that they may cause undefined behavior after a failure.
  • Bugfixes:
    • Code cleanup and performance cleanup in non-FT builds; –without-ft at configure time gives an almost stock Open MPI.
    • Code cleanup and performance cleanup in FT builds with FT runtime disabled; –mca ft_enable_mpi false thoroughly disables FT runtime activities.
    • Some error cases would return ERR_PENDING instead of ERR_PROC_FAILED in collective operations.
    • Some test could set ERR_PENDING or ERR_PROC_FAILED instead of ERR_PROC_FAILED_PENDING for ANY_SOURCE receptions.

Release 1.1

Focus has been toward improving stability, feature coverage for intercomms, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

  • Forked from Open MPI 1.5.5 devel branch
  • Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
  • Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed -extension- names MPIX_ERR_xxx.
  • Support for Intercommunicators:
    • Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
    • MPI_COMM_REVOKE tested on intercommunicators.
  • Disabled completely (.ompi_ignore) many untested components.
  • Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
  • Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
  • Bugfixes:
    • SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
    • SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
    • Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
    • Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
    • Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

Release 1.0

Focus has been toward improving performance, both before and after the occurrence of failures. The list of new features includes:

  • Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
  • Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
  • New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK.
  • New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks.
  • Improved support for our traditional network layer:
    • TCP: fully tested
    • SM: fully tested (with the exception of XPMEM, which remains unsupported)
  • Added support for High Performance networks
    • Open IB: reasonably tested
    • uGNI: reasonably tested
  • The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
    • Back-ported PBS/ALPS fixes from Open MPI
    • Back-ported OpenIB bug/performance fixes from Open MPI
    • Improve Context ID allocation algorithm to reduce overheads of Shrink
    • Miscellaneous bug fixes

Version Numbers and Binary Compatibility

Starting from ULFM Open MPI version 2.0, ULFM Open MPI is binary compatible with the corresponding Open MPI master branch and compatible releases (see the binary compatibility and version number section in the upstream Open MPI README). That is, applications compiled with a compatible Open MPI can run with the ULFM Open MPI mpirun and MPI libraries. Conversely, as long as the application does not employ one of the MPIX functions, which are exclusively defined in ULFM Open MPI, an application compiled with ULFM Open MPI can be launched with a compatible Open MPI mpirun and run with the non-fault tolerant MPI library.

Contacting the Authors

Found a bug? Got a question? Want to make a suggestion? Want to contribute to ULFM Open MPI? Working on a cool use-case? Please let us know! The best way to report bugs, send comments, or ask questions is to sign up on the user’s mailing list: ulfm+subscribe@googlegroups.com Because of spam, only subscribers are allowed to post to these lists (ensure that you subscribe with and post from exactly the same e-mail address — joe@example.com is considered different thanjoe@mycomputer.example.com!). Visit these pages to subscribe to the lists: https://groups.google.com/forum/#!forum/ulfm When submitting questions and problems, be sure to include as much information as possible. This web page details all the information that we request in order to provide assistance: http://www.open-mpi.org/community/help/

Copyright

Running on Edison

ULFM configuration for NERSC Edison

Edison is a Cray XC30, with a peak performance of 2.57 petaflops/sec, 133,824 compute cores, 357 terabytes of memory, and 7.56 petabytes of disk. Hosted at NERSC, it offers access to many researchers and practitioners in the US.

Installing the latest versions of Open MPI (including the development, unstable and stable) on this machine can easily be done by using the provided platform file. The same cannot unfortunately be said about the current ULFM version. This version was based on an older unstable version of Open MPI (1.7) and didnt backported all the new exciting features from the development branch of Open MPI. As a result, getting it to work on Edison is a little challenging, but we all like a little challenge.

Preparing for installation

This step is unfortunately required, as I couldnt find any other way to update the m4 file of our ancient ULFM installation, without dedicating too much time to such a hopeless task (especially as the new ULFM 2.0 is reaching almost stable status). The issue is triggered by a misconfiguration of autoconf 2.69 on Edison, which breaks all the autotool chain. The issue can be easily corrected by downloading the Mercurial version of ULFM on another computer, running autogen.sh, a quick configure with no arguments, followed by “make dist”. This will generate 2 archives (openmpi-1.7.1ULFM.tar.bz2 and openmpi-1.7.1ULFM.tar.gz). Use the one corresponding to your preferred compression algorithm, and move it on Edison.

Compiling from source

With the archive moved on Edison, you can now undergo the last step to configure and compile ULFM there. Lets assume we want to build an optimized version, integrated with Edison resource management software (Slurm), using uGNI, without debugging information. One of the interesting features ULFM inherited from Open MPI is the capability of using platform files. Here is the platform file for Edison, it will eventually make its way into the official ULFM release, meanwhile you can copy and paste directly from here.

With this platform file saved as platform.edisonyou can now proceed to configure ULFM using “./configure —with-platform=platform.edison …”. There are many other possible configuration parameters, and you are strongly encouraged to read the output of configure helpto gain more insight into the flexible configuration system. Meanwhile here is what I usually add in addition to the platform file “—prefix=… –enable-mpirun-prefix-by-default” (please replace the dots with your prefered installation directory). Keep in mind that the installation directory must be in your $PATH, to have direct access to the wrapper compilers and to the mpiexec.

Preparing for execution

In general we suggest you to use a specific MCA file (the -am parameter) to provide the parameters for your ULFM runs (and not collide with the normal Open MPI runs). As long as you use mpiexec to start your application this will work. If instead you want to rely on srun to so do, you will have to dump the context of your AMCA file directly into your main ${HOME}/.openmpi/mca-params.conf file (and pay attention when you execute normal Open MPI runs).

Running ULFM applications in an interactive job

Congratulations, the hardest steps are now done and you are now supposed to have a fully compiled and ready to run version of ULFM, and its binaries in your $PATH. Allocating an interactive job on Edison is well described in the online documentation. There is however a trick, you need to add an option to prevent Slurm from destroying your job when one of the processes disappear. This option is no-kill.

Thus, allocating an interactive job with 4 processes for a short duration can be achieved with salloc -p debug -N 4 --no-kill`.
Running an application then become easy, as indicated in ULFM setup steps.

Survival readiness

You code has now a sane execution environment that is able to survive faults, and can help your application achieve the same goal.

SC’ 16 paper: Failure Detection and Propagation in HPC System

We have presented our paper on the scalable failure detection, “Failure Detection and Propagation in HPC System” at SC’16.

Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds.
The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior.
The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the ORNL Titan supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm.

The slides presented by Yves are here and the paper is also available.

Download (PDF, 638KB)


This failure detector is implemented in the ULFM 2.0, an effort that will soon become open to public.

SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Salt Lake City, UT !!!

ULFM-1.1 Release

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers.

Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

  • Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
  • Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx.
  • Support for Intercommunicators:
    • Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
    • MPI_COMM_REVOKE tested on intercommunicators.
  • Disabled completely (.ompi_ignore) many untested components
  • Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
  • Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
  • Bugfixes:
    • SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
    • SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
  • Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
  • Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
  • Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

Get the source and happy hacking,
The ULFM team

SC’15 tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

Get the slides part1, part2, and the examples

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Austin, TX !!!

Tutorial @ SC’14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

Reliability is one of the major concerns when envisioning the future Exascale platforms. The IESP projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption (MTTI) of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable and take them into account by integrating some form of fault-tolerance.

One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.

In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
1. Download the version prepared for this tutorial.
2. Untar it in some convenient location.

3. Go inside the newly untarred directory and launch

4. Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.

5. Compile, link and install

6. Add the directory provided as a –prefix in Step 4 to your PATH and LD_LIBRARY_PATH. As an example for bash one can add the following two lines to ${HOME}/.bashrc

7. You’re almost ready for the hand-on session.
8. The archive with the examples and skeletons is available here.

Let’s rock the faults!

The slides used during this tutorial are available here, here and here.

Uniform Intercomm Creation

A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function.

Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6):

This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag.

In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like
MPI_Intercomm_create

So far so good, but what “uniformly” means? Based on some web resources uniformly is about consistency, in this particular instance about the fact that all participants will return the same answer (aka. the some inter-communicator or the same error code) despite all odds. Here we are looking specifically from the point of view of process failures, more precisely a single failure. Obviously, the trivial approach where one agrees on the resulting intercomm doesn’t work, as due to process failures some processes might not have an intercomm. Clearly, we need a different, better approach.

While talking about this, Aurelien proposed the following solution, entirely based on MPI_Barrier.

Looks quite simple and straightforward. Does it do reach the expected consensus at the end? Unfortunately not, it doesn’t work in all cases. Imagine that all peers in one side succeeded to create the intercomm while everyone in the remote communicator failed. Then on one side the interexists is TRUE while on the other side it is FALSE. Thus the local barrier (line 3) succeed on the side with interexists equal to TRUE and the intercomm is not revoked. As the intercomm exists, these processes will then go in the second MPI_Barrier (line 10), on the “half-created” intercomm, and will deadlock as there will never be a remote process to participate to the barrier.

What we are missing here is an agreement between the two sides (A and B) that things went well for everyone. So let’s try to use agreement to reach this goal. In the remaining of this post I will use the function AGREE, which is a point-to-point agreement between two processes (the two leaders). We need this particular function, as we cannot call MPI_Comm_agree on the bridge communicator, simply because the two leaders are not the only participants in the bridge communicator. Fortunately, implementing an agreement between two processes is almost a trivial task, so I will consider the AGREE function as existing.

If there is no failure the code works, but it’s an overkill. Now, if we have one failure during the execution there are several cases.
1. The failure happens before the MPI_Intercomm_create. Some processes will have an intercomm, and some will not. Thus the local agreement will make everyone on the same group (A and B) agree about the local existence of the intercomm. The AGREE between the leaders will then propagate this knowledge to the remote group leader. At this point the two leaders have the correct answer, and then the MPI_Comm_agree at line 8 will finally distribute this answer on the two groups. This case works!
2. The failure happen after the MPI_Intercomm_create. One can think about this case as a false positive (because the intercomm exists on all alive processes), but the real question is if we are able to detect it and reach a consensus. The MPI_Comm_agree at line 3 will make the flag TRUE on both groups. Note that on one side the MPI_Comm_agree will return MPI_ERR_PROC_FAILED as required by the MPI standard, but we do not take in account the return code here.

Here is the reasoning leading to this form of the algorithm.
1. The MPI_Comm_agree on line 3 is a little tricky. We use the returned value of the flag (bitwise AND operation between all living processes), but we willingly choose to ignore the return value. What really matters is that all alive participants have the resulting intercomm, and this will be reflected in the value of flag after the call. All other cases are trapped either by partially existing intercomm, or by the REVOKE call. Moreover, it is extremely important that the resulting value of flag is computed through an agreement and not through a traditional collective communication (which lack the consistency factor).
2. For a similar reason the MPI_Comm_agree on line 9 cannot be replaced by a MPI_Broadcast, even as its meaning is to broadcast the information collected at the local root into the local group of processes. Without the use of MPI_Comm_agree here one cannot distinguish between every process still alive has an intercomm with known faults or only some participants have an intercomm (with faults). Thus, as the intercomm can be revoked only if all alive processes know about it, we have to enforce this consistent view by the use of the agreement.

While this inter-communicator creation looks like an extremely complicated and expensive matter, one should keep in mind that such communicator creation are not supposed to be in the critical path, so they might not be as prohibitive as one might think. If Everything Was Easy Nothing Would Be Worth It!