Categories
Tutorial ULFM User Level Failure Mitigation

SC’20 Tutorial

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’20. This year, the tutorial will be split in two sessions over two days. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Monday the 9th material
    1. slides Introduction and Methods for Fault Tolerance
    2. slides Failures and Silent Errors
    3. slides Checkpointing: the Young/Daly formula
    4. slides In-memory & Hierarchical Checkpointing, Replication
    5. slides Silent Errors
  2. Tuesday the 10th material
    1. slides Handling Errors in MPI Applications
    2. slides Recovering MPI Application
    1. Examples
    2. docker ULFM Fault Tolerant MPI on your desktop

The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction [1, 2, 3, 4, 5] covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions [1, 2] introduce a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

Enjoy our promotional video 😉


See you all in online !!!

Categories
ULFM User Level Failure Mitigation

Spurious Errors: lack of MPI Progress and Failure Detection

A very common mishap while developing parallel applications is to assume an application running at small scales automatically translates into successful large scale runs. Such optimistic views are largely unproven in general, but tools exists to help with the validation process. However, adding resilience to a parallel application has a tendency to increase the likelihood of consistency errors, and unfortunately no tools to help through this process currently exist. In some cases, few common sense practices could save hours of debugging, and improve the quality of the parallel application.

As you work on you MPI fault tolerant application, you discover it runs fine for small scales and small data sets, but when increasing the number of processes or the computational load on the participating processes, spurious faults seems to be ‘injected’ for no good reason. It might be easy to blame it on the underlying libraries, but before we go there it is possible you are observing an inter-operability issue between the different layers of the resilient software stack. More precisely, you may be observing the effect of the lack of MPI progress on the failure detector within the MPI library.

MPI Progress (and lack thereof)

The MPI Standard does not mandate that asynchronous progress shall always happen. When you post any nonblocking communications, such as MPI_Irecv or MPI_Isend, you may be surprised to discover that, depending on the MPI library configuration and/or the network transport in use, no progress at all may have happened before you enter the corresponding completion function MPI_Wait, or in another MPI function. This is especially true for non-blocking collectives, for which asynchronous progress is a rarity, even with network hardware that would otherwise be capable of asynchronous progress.

The root cause is an optimization that helps obtain good results on raw latency/injection rate (e.g., in performance benchmarks), but is not an optimal choice for applications with long swath of computational code that does not perform any MPI calls (hence, does not trigger the progress engine of MPI). Achieving asynchronous progress for all operations often requires a progress thread, internal to the MPI implementation, which entails that most MPI calls have to be thread-safe, even if the user initializes with MPI_THREAD_SINGLE. It is also more difficult to implement. Hence, most MPI implementations defaults to an ‘user-call-progressed’ mode, that is, message transmission and reception management are both progressed when the user enters one of the ‘progressing’ MPI routines (e.g., MPI_Wait, MPI_Test, MPI_Recv, etc.) Upon, entering such a routine, any MPI operation currently ongoing may progress, including operations that are not referenced by the call.

For the same reasons that asynchronous progress is not the default behavior in Open MPI (latency performance optimization), the failure detector in ULFM is also call-progress based.

An example

Let’s look at the following code example to illustrate the progress issue.

Without Fault Tolerance

The program will report how much of the time was overlapped between the compute operation and the reception of the large message. Note that this is the easy case: a contiguous receive. Hence, most high performance networks will achieve overlap. If we run on one of the networks that does not perform asynchronous progress, we may still observe that the majority (or the totality) of the communication time is visible in the duration of the MPI_Wait function. In essence, the lack of progress can cause disappointing performance with respect to communication/computation overlap.

With Fault Tolerance

The program will produce the following output:

Our test program has aborted, because the MPI library has reported that a process has failed. This is manifestly a spurious error that is raised despite the fact that no process has failed. The root cause lies in the fact that, since the application stopped performing any MPI calls for longer than the detection timeout, there was no opportunity for progressing the MPI library, either to progress asynchronous communications (the aforementioned performance issue), or to progress the internal failure detector. Due to lack of progress, the dates at which heartbeat messages are sent and their reception managed can drift from the expectations set in the failure detector, and that drift can grow to make a process appear unresponsive for long enough to be reported as dead (hence resulting in the occurrence of a spurious error).

Tuning the Detector for your Use Case

Most communication intensive applications will do just well with the default settings for the failure detector. These settings favor latency and injection rate metrics by setting some sane defaults that will avoid triggering spurious faults in most communicating applications, while minimizing the performance impact on the most common communication patterns. It is however important to note that when spurious errors do get reported, it may be a simple matter of tuning the runtime parameters, or your application code to fix the error.

Solution 1: tune the detector

The detector in ULFM is highly configurable. You can change both the timeout, and the use of an internal progress thread within the MPI library that will ensure that the detector remains active when the application does not perform MPI progress. You can set the following mpirun parameters to control the detector:

  • -mca mpi_ft_detector_timeout number: number of seconds before the detector reports an error from an unresponsive process (floating point value). If you experience spurious errors, this can be a first step at weeding them out. The advantage of this approach is that it will leave micro-benchmarks performance unchanged; however, the timeout for detection of actual failures will be increased. Given that failures are somewhat rare, this can often be a beneficial trade-off.
  • -mca mpi_ft_detector_thread [true/false]: defaults to false (to preserve micro-benchmark performance). When set to true, the detector executes in a MPI library internal thread, thus avoiding the dependence on user progress. The performance trade-off comes from the fact that this flag will act as-if MPI_Init had been called with MPI_THREAD_MULTIPLE, which has a mild impact on latency/injection rate. This is often a good trade-off for real applications.
  • -mca mpi_ft_detector [true/false]: can be used to turn off the detector altogether. This approach can work with some networks that provide in-band error reporting (e.g., TCP), but the failure detection timeout is bounded by external parameters (TCP timeouts can reach into hours under certain circumstances), and some networks do not provide in-band detection, which can result in deadlocks in failure scenarios.

Solution 2: progress MPI

A different approach is to dedicate some resources to progressing MPI operations. A traditional way for achieving continuous progress from the user-code is to initialize MPI with MPI_THREAD_MULTIPLE, and dedicate a thread to MPI progress by parking the thread in an MPI_Recv for the entire duration of the program. This receive should never be matched by a send, except at the very end of the program, in order to release the progress thread and let it terminate.

This technique lets the user dedicate one of her own threads to ensure that the MPI progress engine is active at all times, and will avoid both spurious errors, as well as often improve performance in applications that are sensitive to communication/computation overlap (especially when using complex communication routines, like non-contiguous datatypes and non-blocking collective operations, which cannot be easily delegated to the network hardware). Note however that the raw latency/bandwidth in micro-benchmarks can be impacted, and one has to dedicate a CPU core to this background activity.

Conclusions

The lack of calls progressing the MPI library is a known adverse situation in HPC applications. In the fault-free world, this can translate into the application significantly under-performing when users expect communication/computation overlap. This issue can compound during fault-tolerant operation by causing drifts in the failure detector, resulting in spurious error reporting (i.e., the notification of failures about processes that have not failed). Fortunately, the ULFM implementation provides multiple parameters that permit circumventing this issue.

Categories
ULFM User Level Failure Mitigation

ULFM 4.0.2u1

The ICL ULFM team is happy to announce ULFM v4.0.2u1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.2). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.2 read the changelog.

This is a stability and upstream parity upgrade, moving ULFM from version 4.0.1 of Open MPI to the latest stable (v4.0.2, October 2019 #cb5f4e737a, ulfm #0e249ca1). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI.

Features

This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM chapter standard draft document, and one additional API MPI_Comm_is_revoked.

  • MPIX_ERR_PROC_FAILED when a process failure prevents the completion of an MPI operation.
  • MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking wildcard source receive has failed.
  • MPIX_ERR_REVOKED when one of the ranks in the application has invoked the MPI_Comm_revoke operation on the communicator.
  • MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on the communicator at all ranks.
  • MPIX_Comm_is_revoked(MPI_Comm comm, int *flag) Query if a communicator is currently revoked.
  • MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new communicator where dead processes in comm were removed.
  • MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault tolerant allreduce operation) on flag (with the operation bitwise or).
  • MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of currently acknowledged failed processes.
  • MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.

Supported Systems

There are four main MPI network models available in Open MPI: “ob1”, “cm”,”yalla”, and “ucx”. Only “ob1” is currently supports the features required for fault tolerance. “ob1” uses BTL (“Byte Transfer Layer”) components for each supported network. “ob1” supports a variety of networks that can be used in combination with each other. Collective operations (blocking and non-blocking) use an optimized implementation on top of “ob1”.

  • Loopback (send-to-self)
  • UCX (beta)
  • TCP
  • OpenFabrics: InfiniBand, iWARP, and RoCE
  • uGNI (Cray Gemini, Aries)
  • Shared memory Vader (FT supported w/CMA, XPmem, KNEM untested)
  • Tuned, and non-blocking collective communications

A full list of supported, untested and disabled components is provided below.

More Information

More information (tutorials, examples, build instructions for leading Top 500 systems) is also available in the Fault Tolerance ResearchHub website:https://fault-tolerance.org

Bibliographic References

If you are looking for, or want to cite a general reference for ULFM, please use

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra: Post-failure recovery of MPI communication capability: Design and rationale. IJHPCA 27(3): 244-254 (2013). Available from: http://journals.sagepub.com/doi/10.1177/1094342013488238.

Building ULFM Open MPI

There are many available configure options (see ./configure --help for a full list); a summary of the more commonly used ones is included in the upstream Open MPI README file. The following paragraph gives a summary of ULFM Open MPI specific options behavior.

Configure options

  • --with-ft=TYPE Specify the type of fault tolerance to enable. Options: mpi (ULFM MPI draft standard). Fault tolerance support is enabled by default(as if --with-ft=mpi were implicitly present on the configure line).You may specify --without-ft to compile an almost stock Open MPI.
  • --with-platform=FILE Load configure options for the build from FILE. When --with-ft=mpi is set, the file contrib/platform/ft_mpi_ulfm is loaded by default. This file disables components that are known to not be able to sustain failures, or are insufficiently tested. You may edit this file and/or force back these options on the command line to enable these components.
  • --enable-mca-no-build=LIST Comma-separated list of pairs that will not be built. For example, --enable-mca-no-build=btl-portals,oob-ud will disable building the portals BTL and the ud OOB component. When --with-ft=mpi is set, this list is populated with the content of the aforementioned platform file. You may override the default list with this parameter.
  • --with-pmi and --with-slurm Force the building of SLURM scheduler support. Slurm with fault tolerance is tested. Do not use srun, otherwise your application gets killed by the RM upon the first failure. Instead, use mpirun in an salloc/sbatch.
  • --with-sge This is untested with fault tolerance.
  • --with-tm Force the building of PBS/Torque scheduler support. PBS is tested with fault tolerance. Use mpirun in a qsub allocation.
  • --disable-mpi-thread-multiple Disable the MPI thread level MPI_THREAD_MULTIPLE (it is enabled by default). Multiple threads with fault tolerance is lightly tested.
  • --disable-oshmem Disable building the OpenSHMEM implementation (by default, it is enabled). ULFM Fault Tolerance does not apply to OpenSHMEM.

Modified, Untested and Disabled Components

Frameworks and components which are not listed in the following list are unmodified and support fault tolerance. Listed frameworks may be modified(and work after a failure), untested (and work before a failure, but may malfunction after a failure), or disabled (they prevent the fault tolerant components from operating properly).

  • pml MPI point-to-point management layer
    • “ob1” modified to handle errors
    • “monitoring”, “v” unmodified, untested
    • “bfo”, “cm”, “crcpw”, “ucx”, “yalla” disabled
  • btl Point-to-point Byte Transfer Layer
    • “openib”, “tcp”, “vader(+cma)”, “uct” modified to handle errors (removed unconditional abort on error, expect performance similar to upstream)
    • “portals4”, “scif”, “smcuda”, “usnic”, “vader(+knem,+xpmem)” unmodified, untested (may work properly, please report)
  • mtl Matching Transport Layer used for MPI point-to-point messages on some types of networks
    • All “mtl” components are disabled
  • coll MPI collective algorithms
    • “tuned”, “basic”, “nbc” modified to handle errors
    • “cuda”, “fca”, “hcoll”, “portals4” unmodified, untested (expect unspecified post-failure behavior)
  • osc MPI one-sided communications
    • Unmodified, untested (expect unspecified post-failure behavior)
  • io MPI I/O and dependent components
    • fs File system functions for MPI I/O
    • fbtl File byte transfer layer: abstraction for individual read/write operations for OMPIO
    • fcoll Collective read and write operations for MPI I/O
    • sharedfp Shared file pointer operations for MPI I/O
    • All components in these frameworks are unmodified, untested (expect clean post-failure abort)
  • vprotocol Checkpoint/Restart components
    • “pml-v”, “crcp” unmodified, untested
  • wait_sync Multithreaded wait-synchronization object
    • modified to handle errors (added a global interrupt to trigger all wait_sync objects)

Running ULFM Open MPI

Building your application

As ULFM is an extension to the MPI standard, you will need to #include "mpi-ext.h" in C, or use mpi_ext in Fortran to access the supplementary error codes and functions. Compile your application as usual, using the provided mpicc, mpif90, or mpicxx wrappers.

Running your application

You can launch your application with fault tolerance by simply using the provided mpiexec. Beware that your distribution may already provide a version of MPI, make sure to set your PATH and LD_LIBRARY_PATH properly. Note that fault tolerance is enabled by default in ULFM Open MPI; you can disable all fault tolerance capabilities by launching your application with mpiexec --disable-recovery.

Running under a batch scheduler

ULFM can operate under a job/batch scheduler, and is tested routinely with both PBS and Slurm. One difficulty comes from the fact that many job schedulers will “cleanup” the application as soon as a process fails. In order to avoid this problem, it is preferred that you use mpiexec within an allocation (e.g. salloc, sbatch, qsub) rather than a direct launch (e.g. srun).

Run-time tuning knobs

ULFM comes with a variety of knobs for controlling how it runs. The default parameters are sane and should result in good performance in most cases. You can change the default settings with --mca mpi_ft_foo.

  • orte_enable_recovery [true|false] (default: true) controls automatic cleanup of apps with failed processes within mpirun. The default differs from upstream Open MPI.
  • mpi_ft_enable [true|false] (default: true) permits turning off fault tolerance without recompiling. Failure detection is disabled. Interfaces defined by the fault tolerance extensions are substituted with corresponding non-fault tolerant implementations (e.g. MPI_Allreduce is substituted to MPIX_Comm_agree).
  • mpi_ft_verbose  (default: 0) increases the output of the fault tolerance activities. A value of 1 will report detected failures.
  • mpi_ft_detector  [true|false] (default: false) controls the activation of the OMPI level failure detector. When this detector if turned off, all failure detection is delegated to ORTE, which may be slow and/or incomplete.
  • mpi_ft_detector_thread [true|false] (default: false) controls the use of a thread to emit and receive failure detector’s heartbeats. Setting this value to “true” will also set MPI_THREAD_MULTIPLE support, which has a noticeable effect on latency. You may want to enable this option if you experience false positive processes incorrectly reported as failed.
  • mpi_ft_detector_period  (default: 1e-1 seconds) heartbeat period. Recommended value is 1/3 of the timeout. Values lower than 100us may impart a noticeable effect on latency (typically a 3us increase).
  • mpi_ft_detector_timeout  (default: 3e-1 seconds) heartbeat timeout (i.e. failure detection speed). Recommended value is 3 times the heartbeat period.

Known Limitations in ULFM v4.0.1u1:

  • Infiniband support is provided through the OpenIB or UCT BTL, fault tolerant operation over the UCX PML is not yet supported.
  • TOPO, FILE, RMA have little support for fault tolerance mechanisms.
  • There is a tradeoff between failure detection accuracy and performance. The current default is to favor performance. Users that experience detection accuracy issues may enable a more precise mode.
  • The failure detector operates on MPI_COMM_WORLD exclusively. Processes connected from MPI_COMM_CONNECT/ACCEPT and MPI_COMM_SPAWN may occasionally not be detected when they fail.
  • Failures during NBC collective may not be recovered properly in some cases.
  • Return of OpenIB credits spent toward a failed process can take several seconds. Until the btl_openib_ib_timeout and btl_openib_ib_retry_count controlled timeout triggers, lack of send credits may cause a temporary stall under certain communication patterns. Impacted users can try to adjust the value of these mca parameters.

Changelog

Release 4.0.2u1

This is a stability and upstream parity upgrade. It is based on the most
current Open MPI Release (v4.0.2, October 2019).

New features

  • This release is based on Open MPI release v4.0.2 (ompi #cb5f4e737a).
  • This release is based on ULFM master (ulfm #0e249ca1).
  • Support for the UCT BTL enters beta stage.

Bugfixes

  • High sensitivity to noise in the failure detector.
  • Deadlocks when revoking while BTL progress threads are updating messages.
  • A case where the failure detector would keep observing a dead process forever.
  • Disable the use of external pmix/libevent by default (the internals are modified to handle error cases).
  • Clean error paths leaving some rdma registration dangling.
  • Do not remove the orte job/proc session dir prematurely upon error.

Release 4.0.1ulfm2.1rc1

New features

  • Addition of the MPI_Comm_is_revoked function
  • Renamed ftbasic collective component to ftagree
  • Restored the pcollreq extension

Bugfixes

  • Failures of node-local siblings were not always detected
  • Failure propagation and detection was slowed down by trying to notify known dead processes
  • There were deadlocks in multithreaded programs
  • There were issues with PMPI when compiling Fortran Interfaces
  • There were deadlocks on OS-X

Release 2.0

Focus has been toward integration with current Open MPI master, performance, and stability.

  • ULFM is now based upon Open MPI master branch (#689f1be9). It will be regularly updated until eventually merged.
  • Fault Tolerance is enabled by default and is controlled with MCA variables.
  • Added support for multithreaded modes (MPI_THREAD_MULTIPLE, etc.)
  • Added support for non-blocking collective operations (NBC).
  • Added support for CMA shared memory transport (Vader).
  • Added support for advanced failure detection at the MPI level. Implements the algorithm described in “Failure detection and propagation in HPC systems.” https://doi.org/10.1109/SC.2016.26.
  • Removed the need for special handling of CID allocation.
  • Non-usable components are automatically removed from the build during configure
  • RMA, FILES, and TOPO components are enabled by default, and usage in a fault tolerant execution warns that they may cause undefined behavior after a failure.
  • Bugfixes:
    • Code cleanup and performance cleanup in non-FT builds; –without-ft at configure time gives an almost stock Open MPI.
    • Code cleanup and performance cleanup in FT builds with FT runtime disabled; –mca ft_enable_mpi false thoroughly disables FT runtime activities.
    • Some error cases would return ERR_PENDING instead of ERR_PROC_FAILED in collective operations.
    • Some test could set ERR_PENDING or ERR_PROC_FAILED instead of ERR_PROC_FAILED_PENDING for ANY_SOURCE receptions.

Release 1.1

Focus has been toward improving stability, feature coverage for intercomms, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

  • Forked from Open MPI 1.5.5 devel branch
  • Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
  • Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed -extension- names MPIX_ERR_xxx.
  • Support for Intercommunicators:
    • Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
    • MPI_COMM_REVOKE tested on intercommunicators.
  • Disabled completely (.ompi_ignore) many untested components.
  • Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
  • Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
  • Bugfixes:
    • SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
    • SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
    • Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
    • Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
    • Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

Release 1.0

Focus has been toward improving performance, both before and after the occurrence of failures. The list of new features includes:

  • Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
  • Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
  • New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK.
  • New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks.
  • Improved support for our traditional network layer:
    • TCP: fully tested
    • SM: fully tested (with the exception of XPMEM, which remains unsupported)
  • Added support for High Performance networks
    • Open IB: reasonably tested
    • uGNI: reasonably tested
  • The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
    • Back-ported PBS/ALPS fixes from Open MPI
    • Back-ported OpenIB bug/performance fixes from Open MPI
    • Improve Context ID allocation algorithm to reduce overheads of Shrink
    • Miscellaneous bug fixes

Version Numbers and Binary Compatibility

Starting from ULFM Open MPI version 2.0, ULFM Open MPI is binary compatible with the corresponding Open MPI master branch and compatible releases (see the binary compatibility and version number section in the upstream Open MPI README). That is, applications compiled with a compatible Open MPI can run with the ULFM Open MPI mpirun and MPI libraries. Conversely, as long as the application does not employ one of the MPIX functions, which are exclusively defined in ULFM Open MPI, an application compiled with ULFM Open MPI can be launched with a compatible Open MPI mpirun and run with the non-fault tolerant MPI library.

Contacting the Authors

Found a bug? Got a question? Want to make a suggestion? Want to contribute to ULFM Open MPI? Working on a cool use-case? Please let us know! The best way to report bugs, send comments, or ask questions is to sign up on the user’s mailing list: ulfm+subscribe@googlegroups.com Because of spam, only subscribers are allowed to post to these lists (ensure that you subscribe with and post from exactly the same e-mail address — joe@example.com is considered different thanjoe@mycomputer.example.com!). Visit these pages to subscribe to the lists: https://groups.google.com/forum/#!forum/ulfm When submitting questions and problems, be sure to include as much information as possible. This web page details all the information that we request in order to provide assistance: http://www.open-mpi.org/community/help/

Copyright

Categories
ULFM User Level Failure Mitigation

SC’19 Tutorial

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’19. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Denver !!!

Categories
Research ULFM User Level Failure Mitigation

Simplifying the ACK/GET_ACKED couple

In this post, we will discuss two alternative proposals to simplify the MPIX_COMM_FAILURES_ACK and MPIX_COMM_FAILURES_GET_ACKED in the User Level Failure Mitigation MPI specification draft.

These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc.

Dealing with wildcard receptions

These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from MPI_ANY_SOURCE.

Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed.

Any-source reception must be interrupted to let the user code avert the deadlock risk if the expected sender process has died.

The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in MPI codes that perform wildcard receptions from multiple threads. In order to give control to users, re-enabling wildcard receptions must be performed by an explicit call, that can be adequately protected with mutual exclusion between threads, when required.

Example 2 illustrates how to employ the MPIX_COMM_FAILURES_ACK function to explicitly re-enable the reception of ANY_SOURCE messages. After that call, operations that would return an error because of the potential failure of a sender process resume. The list of such processes for which failures are acknowledged can then be obtained (without side effects, possibly from another thread) by using MPIX_COMM_FAILURES_GET_ACKED to produce the group of failed processes at the time of the ACK call.

A combination of ACK/GET_ACKED lets resume any-source receptions.

What we like, what we don’t like

The good

The ACK/GET_ACKED pair achieves the goal of controlling re-enabling any-source receptions even in multithreaded programs with a sane meaning that enable user-code to reason and synchronize around it. Even better, one can resume any-source receptions without ever producing the group of failed processes, or mandating the use of data structures linear in the number of ranks in each communicators.

The Bad

However, the interface can be confusing for unsuspecting users.

First, the MPIX_COMM_FAILURES_ACK mutate the state in a way that is not a priori known. When one calls that function, it has the effect of acknowledging all thus far detected failures (as seen from the implementation perspective). An user can know what has been acknowledged only after the fact, by using the sister call MPIX_COMM_FAILURES_GET_ACKED.

Second, it is not possible to obtain the list of failed processes without side effects (i.e., acknowledging failures) on the considered communicator, which is undesirable in some common usage patterns that have been seen deployed by users of ULFM.

The proposed evolution, with two different twists

We want to add two new abilities, while at the same time making the interface more intuitive (i.e., reading the function name should be enough to understand what it does)

Reading the group of failed processes without side effects

The first addition we want to make is a way to peek at the list of failed processes without changing what is acknowledged (or not). This can be achieved in a simple enough way by adding the following function:

MPIX_Comm_get_failed(comm, &fgrp)
MPI_Comm comm;
MPI_Group* fgrp;
  • This function obtains the current (local, non-synchronized) list of known failed processes.
  • Repeated calls to the function return a monotonically increasing group (i.e., the group produced by an earlier call is always a prefix of the group produced during the current call).
  • Two calls may produce different groups even if no MPI errors have been reported between the two calls.
  • If a communication has raised a Process Failure error, the process that caused the error to be raised is in the produced group in a later call to MPIX_Comm_get_failed.

So, this is simple enough. If a process needs to know what processes are known to have failed thus far, without incurring the cost of synchronizing that knowledge with other ranks, that’s what you get from this function. Note that it means that another rank calling that function “at the same time” may get a different group (if that’s not your use case, MPIX_Comm_agree, or MPIX_Comm_skrink do synchronize fault knowledge).

You also get to peek at everything that the MPI implementation knows. That is, using this function, you may know that a process is dead before any error has been raised in a communication operation.

The groups produced by a sequence of calls are monotonic, that is, the result of any previous call to MPIX_Comm_get_failed is a prefix to the current call. This will be important when we go to the next topic: controlling what we acknowledge.

Acknowledging only what we already know about, and knowing what we have acknowledged.

Know that we can produce a group of failed processes, lets use that knowledge to control how we acknowledge failed processes (and thus control any-source reception error behavior). In both variants we will also rely on previously described MPIX_Comm_get_failed.

Variant1: counting and acknowledged group

MPIX_Comm_failure_ack(comm, nack) MPI_Comm comm; int nack;

MPIX_Comm_failure_get_acked(comm, &agrp) MPI_Comm comm; MPI_Group* fgrp;

The first function MPIX_Comm_failure_ack lets the user code aknowledge (an re-enable any-source), but for the given maximum number of processes nack. That is, if a code looks at the fgrp and is satisfied that this is a correctable condition and any-source shall resume, MPI_GROUP_SIZE(fgrp, &nack) prevents from acknowledging not-yet considered failures that may have been detected in the meantime. That is, we now have a way to know in advance what we are acknowledging.

Note that if nack is smaller than what has been acknowledged in a previous call, the call is a no-op. Conversely, if nack is larger than nfailed, the size of the fgrp that would be produced by a concurrent MPIX_Comm_get_failed call, then only nfailed processes are acknowledged.

The second function MPIX_comm_failure_get_acked obtains the group of currently acknowledged faulty processes.

In simple cases (for example, when nack <= nfailed), one can simply reason about how many members of fgrp have been acknowledged. However, if nack > nfailed, or when multiple software modules or threads are issuing MPIX_Comm_failure_ack, one may want to see what is currently acknowledged without having to rely on shared variables.

Thus, MPIX_comm_failure_get_acked produces the group agrp, the group of currently acknowledged failed processes. If no calls to MPIX_Comm_failure_ack are issued on comm (possibly from another software module or thread), the function keeps producing similar groups.

Variant2: Counting only

The first interface variant achieves almost all of the goals we have set. The only limitation is that one cannot just count how many processes have been acknowledged without producing the group agrp. In some use cases (e.g., monte carlo), the only important criterion is how many workers are left, not what workers are still operating, hence producing agrp can be seen as pure overhead.

A solution to that problem can be achieved by changing the acknowledging function so that it returns how many members in fgrp have actually been acknowledged by the call.

MPIX_Comm_failure_ack_failed(comm, nack, &nacked)
MPI_Comm comm;
int nack;
int* nacked;

This function does the same as the variant 1 above, but in addition it produces in nacked the size of the agrp group produced by a later call to MPIX_Comm_faillure_get_acked. This permits counting the number of acknowledged (and thus failed) processes without ever building fgrp or agrp.

More Example use cases

Code does not do any-source, just looking at the group of failed processes without incurring side effects.
Resuming any-source unconditionally, a common usage pattern in Monte-Carlo codes. Counter based (variant2) can also count the number of failed processes for free.
In the counting variant (variant2), one can obtain the acked group without using the additional function MPIX_Comm_failure_get_acked (useful if MPIX_Comm_get_failed has been called before).
Comparison between variant1, variant2, and old ACK/GET_ACKED.
Categories
ULFM User Level Failure Mitigation

ULFM 2.1rc1 Open MPI v4.0.1 based

The ICL ULFM team is happy to announce ULFM v4.0.1ulfm2.1rc1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.1). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.1 read the changelog.

This is a stability and upstream parity upgrade, moving ULFM from an old unreleased version of Open MPI to the latest stable (v4.0.1, May 2019 #b780667). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI.

Features

This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM chapter standard draft document, and one additional API MPI_Comm_is_revoked.

  • MPIX_ERR_PROC_FAILED when a process failure prevents the completion of an MPI operation.
  • MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking wildcard source receive has failed.
  • MPIX_ERR_REVOKED when one of the ranks in the application has invoked the MPI_Comm_revoke operation on the communicator.
  • MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on the communicator at all ranks.
  • MPIX_Comm_is_revoked(MPI_Comm comm, int *flag) Query if a communicator is currently revoked.
  • MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new communicator where dead processes in comm were removed.
  • MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault tolerant allreduce operation) on flag (with the operation bitwise or).
  • MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of currently acknowledged failed processes.
  • MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.

Supported Systems

There are four main MPI network models available in Open MPI: “ob1”, “cm”,”yalla”, and “ucx”. Only “ob1” is currently supports the features required for fault tolerance. “ob1” uses BTL (“Byte Transfer Layer”) components for each supported network. “ob1” supports a variety of networks that can be used in combination with each other. Collective operations (blocking and non-blocking) use an optimized implementation on top of “ob1”.

  • Loopback (send-to-self)
  • TCP
  • OpenFabrics: InfiniBand, iWARP, and RoCE
  • uGNI (Cray Gemini, Aries)
  • Shared memory Vader (FT supported w/CMA, XPmem, KNEM untested)
  • Tuned, and non-blocking collective communications

A full list of supported, untested and disabled components is provided below.

More Information

More information (tutorials, examples, build instructions for leading Top 500 systems) is also available in the Fault Tolerance ResearchHub website:https://fault-tolerance.org

Bibliographic References

If you are looking for, or want to cite a general reference for ULFM, please use

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra: Post-failure recovery of MPI communication capability: Design and rationale. IJHPCA 27(3): 244-254 (2013). Available from: http://journals.sagepub.com/doi/10.1177/1094342013488238.

Building ULFM Open MPI

There are many available configure options (see ./configure --help for a full list); a summary of the more commonly used ones is included in the upstream Open MPI README file. The following paragraph gives a summary of ULFM Open MPI specific options behavior.

Configure options

  • --with-ft=TYPE Specify the type of fault tolerance to enable. Options: mpi (ULFM MPI draft standard). Fault tolerance support is enabled by default(as if --with-ft=mpi were implicitly present on the configure line).You may specify --without-ft to compile an almost stock Open MPI.
  • --with-platform=FILE Load configure options for the build from FILE. When --with-ft=mpi is set, the file contrib/platform/ft_mpi_ulfm is loaded by default. This file disables components that are known to not be able to sustain failures, or are insufficiently tested. You may edit this file and/or force back these options on the command line to enable these components.
  • --enable-mca-no-build=LIST Comma-separated list of pairs that will not be built. For example, --enable-mca-no-build=btl-portals,oob-ud will disable building the portals BTL and the ud OOB component. When --with-ft=mpi is set, this list is populated with the content of the aforementioned platform file. You may override the default list with this parameter.
  • --with-pmi and --with-slurm Force the building of SLURM scheduler support. Slurm with fault tolerance is tested. Do not use srun, otherwise your application gets killed by the RM upon the first failure. Instead, use mpirun in an salloc/sbatch.
  • --with-sge This is untested with fault tolerance.
  • --with-tm Force the building of PBS/Torque scheduler support. PBS is tested with fault tolerance. Use mpirun in a qsub allocation.
  • --disable-mpi-thread-multiple Disable the MPI thread level MPI_THREAD_MULTIPLE (it is enabled by default). Multiple threads with fault tolerance is lightly tested.
  • --disable-oshmem Disable building the OpenSHMEM implementation (by default, it is enabled). ULFM Fault Tolerance does not apply to OpenSHMEM.

Modified, Untested and Disabled Components

Frameworks and components which are not listed in the following list are unmodified and support fault tolerance. Listed frameworks may be modified(and work after a failure), untested (and work before a failure, but may malfunction after a failure), or disabled (they prevent the fault tolerant components from operating properly).

  • pml MPI point-to-point management layer
    • “ob1” modified to handle errors
    • “monitoring”, “v” unmodified, untested
    • “bfo”, “cm”, “crcpw”, “ucx”, “yalla” disabled
  • btl Point-to-point Byte Transfer Layer
    • “openib”, “tcp”, “vader(+cma)” modified to handle errors (removed unconditional abort on error, expect performance similar to upstream)
    • “portals4”, “scif”, “smcuda”, “usnic”, “vader(+knem,+xpmem)” unmodified, untested (may work properly, please report)
  • mtl Matching Transport Layer used for MPI point-to-point messages on some types of networks
    • All “mtl” components are disabled
  • coll MPI collective algorithms
    • “tuned”, “basic”, “nbc” modified to handle errors
    • “cuda”, “fca”, “hcoll”, “portals4” unmodified, untested (expect unspecified post-failure behavior)
  • osc MPI one-sided communications
    • Unmodified, untested (expect unspecified post-failure behavior)
  • io MPI I/O and dependent components
    • fs File system functions for MPI I/O
    • fbtl File byte transfer layer: abstraction for individual read/write operations for OMPIO
    • fcoll Collective read and write operations for MPI I/O
    • sharedfp Shared file pointer operations for MPI I/O
    • All components in these frameworks are unmodified, untested (expect clean post-failure abort)
  • vprotocol Checkpoint/Restart components
    • “pml-v”, “crcp” unmodified, untested
  • wait_sync Multithreaded wait-synchronization object
    • modified to handle errors (added a global interrupt to trigger all wait_sync objects)

Running ULFM Open MPI

Building your application

As ULFM is an extension to the MPI standard, you will need to #include "mpi-ext.h" in C, or use mpi_ext in Fortran to access the supplementary error codes and functions. Compile your application as usual, using the provided mpicc, mpif90, or mpicxx wrappers.

Running your application

You can launch your application with fault tolerance by simply using the provided mpiexec. Beware that your distribution may already provide a version of MPI, make sure to set your PATH and LD_LIBRARY_PATH properly. Note that fault tolerance is enabled by default in ULFM Open MPI; you can disable all fault tolerance capabilities by launching your application with mpiexec --disable-recovery.

Running under a batch scheduler

ULFM can operate under a job/batch scheduler, and is tested routinely with both PBS and Slurm. One difficulty comes from the fact that many job schedulers will “cleanup” the application as soon as a process fails. In order to avoid this problem, it is preferred that you use mpiexec within an allocation (e.g. salloc, sbatch, qsub) rather than a direct launch (e.g. srun).

Run-time tuning knobs

ULFM comes with a variety of knobs for controlling how it runs. The default parameters are sane and should result in good performance in most cases. You can change the default settings with --mca mpi_ft_foo.

  • orte_enable_recovery &lt;true|false&gt; (default: true) controls automatic cleanup of apps with failed processes within mpirun. The default differs from upstream Open MPI.
  • mpi_ft_enable &lt;true|false&gt; (default: true) permits turning off fault tolerance without recompiling. Failure detection is disabled. Interfaces defined by the fault tolerance extensions are substituted with corresponding non-fault tolerant implementations (e.g. MPI_Allreduce is substituted to MPIX_Comm_agree).
  • mpi_ft_verbose  (default: 0) increases the output of the fault tolerance activities. A value of 1 will report detected failures.
  • mpi_ft_detector  &lt;true|false&gt; (default: false) controls the activation of the OMPI level failure detector. When this detector if turned off, all failure detection is delegated to ORTE, which may be slow and/or incomplete.
  • mpi_ft_detector_thread &lt;true|false&gt; (default: false) controls the use of a thread to emit and receive failure detector’s heartbeats. Setting this value to “true” will also set MPI_THREAD_MULTIPLE support, which has a noticeable effect on latency. You may want to enable this option if you experience false positive processes incorrectly reported as failed.
  • mpi_ft_detector_period  (default: 1e-1 seconds) heartbeat period. Recommended value is 1/3 of the timeout. Values lower than 100us may impart a noticeable effect on latency (typically a 3us increase).
  • mpi_ft_detector_timeout  (default: 3e-1 seconds) heartbeat timeout (i.e. failure detection speed). Recommended value is 3 times the heartbeat period.

Known Limitations in ULFM-2.0-rc (2e75c73):

  • TOPO, FILE, RMA have little support for fault tolerance mechanisms.
  • There is a tradeoff between failure detection accuracy and performance. The current default is to favor performance. Users that experience detection accuracy issues may enable a more precise mode.
  • The failure detector operates on MPI_COMM_WORLD exclusively. Processes connected from MPI_COMM_CONNECT/ACCEPT and MPI_COMM_SPAWN may occasionally not be detected when they fail.
  • Failures during NBC collective may not be recovered properly in some cases.
  • Return of OpenIB credits spent toward a failed process can take several seconds. Until the btl_openib_ib_timeout and btl_openib_ib_retry_count controlled timeout triggers, lack of send credits may cause a temporary stall under certain communication patterns. Impacted users can try to adjust the value of these mca parameters.

Changelog

Release 4.0.1ulfm2.1rc1

New features

  • Addition of the MPI_Comm_is_revoked function
  • Renamed ftbasic collective component to ftagree
  • Restored the pcollreq extension

Bugfixes

  • Failures of node-local siblings were not always detected
  • Failure propagation and detection was slowed down by trying to notify known dead processes
  • There were deadlocks in multithreaded programs
  • There were issues with PMPI when compiling Fortran Interfaces
  • There were deadlocks on OS-X

Release 2.0

Focus has been toward integration with current Open MPI master, performance, and stability.

  • ULFM is now based upon Open MPI master branch (#689f1be9). It will be regularly updated until eventually merged.
  • Fault Tolerance is enabled by default and is controlled with MCA variables.
  • Added support for multithreaded modes (MPI_THREAD_MULTIPLE, etc.)
  • Added support for non-blocking collective operations (NBC).
  • Added support for CMA shared memory transport (Vader).
  • Added support for advanced failure detection at the MPI level. Implements the algorithm described in “Failure detection and propagation in HPC systems.” https://doi.org/10.1109/SC.2016.26.
  • Removed the need for special handling of CID allocation.
  • Non-usable components are automatically removed from the build during configure
  • RMA, FILES, and TOPO components are enabled by default, and usage in a fault tolerant execution warns that they may cause undefined behavior after a failure.
  • Bugfixes:
    • Code cleanup and performance cleanup in non-FT builds; –without-ft at configure time gives an almost stock Open MPI.
    • Code cleanup and performance cleanup in FT builds with FT runtime disabled; –mca ft_enable_mpi false thoroughly disables FT runtime activities.
    • Some error cases would return ERR_PENDING instead of ERR_PROC_FAILED in collective operations.
    • Some test could set ERR_PENDING or ERR_PROC_FAILED instead of ERR_PROC_FAILED_PENDING for ANY_SOURCE receptions.

Release 1.1

Focus has been toward improving stability, feature coverage for intercomms, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

  • Forked from Open MPI 1.5.5 devel branch
  • Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
  • Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed -extension- names MPIX_ERR_xxx.
  • Support for Intercommunicators:
    • Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
    • MPI_COMM_REVOKE tested on intercommunicators.
  • Disabled completely (.ompi_ignore) many untested components.
  • Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
  • Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
  • Bugfixes:
    • SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
    • SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
    • Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
    • Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
    • Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

Release 1.0

Focus has been toward improving performance, both before and after the occurrence of failures. The list of new features includes:

  • Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
  • Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
  • New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK.
  • New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks.
  • Improved support for our traditional network layer:
    • TCP: fully tested
    • SM: fully tested (with the exception of XPMEM, which remains unsupported)
  • Added support for High Performance networks
    • Open IB: reasonably tested
    • uGNI: reasonably tested
  • The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
    • Back-ported PBS/ALPS fixes from Open MPI
    • Back-ported OpenIB bug/performance fixes from Open MPI
    • Improve Context ID allocation algorithm to reduce overheads of Shrink
    • Miscellaneous bug fixes

Version Numbers and Binary Compatibility

Starting from ULFM Open MPI version 2.0, ULFM Open MPI is binary compatible with the corresponding Open MPI master branch and compatible releases (see the binary compatibility and version number section in the upstream Open MPI README). That is, applications compiled with a compatible Open MPI can run with the ULFM Open MPI mpirun and MPI libraries. Conversely, as long as the application does not employ one of the MPIX functions, which are exclusively defined in ULFM Open MPI, an application compiled with ULFM Open MPI can be launched with a compatible Open MPI mpirun and run with the non-fault tolerant MPI library.

Contacting the Authors

Found a bug? Got a question? Want to make a suggestion? Want to contribute to ULFM Open MPI? Working on a cool use-case? Please let us know! The best way to report bugs, send comments, or ask questions is to sign up on the user’s mailing list: ulfm+subscribe@googlegroups.com Because of spam, only subscribers are allowed to post to these lists (ensure that you subscribe with and post from exactly the same e-mail address — joe@example.com is considered different thanjoe@mycomputer.example.com!). Visit these pages to subscribe to the lists: https://groups.google.com/forum/#!forum/ulfm When submitting questions and problems, be sure to include as much information as possible. This web page details all the information that we request in order to provide assistance: http://www.open-mpi.org/community/help/

Copyright

Categories
ULFM User Level Failure Mitigation

SC’18 Tutorial

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part.

  1. Theoretical Session slides
  2. Practical Session slides
  3. Examples
  4. ULFM docker

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’18 (somewhat similar to past incarnations). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker.

More information about the tutorial can be found here. Enjoy our promotional video 😉

See you all in Texas !!!

Categories
ULFM User Level Failure Mitigation

ULFM 2.1a1 Docker Package

There are many ways to install ULFM. For performance evaluation, large scale experiments and platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. The 2.1a1 Docker image is a bugfix release compared to the previous 2.0rc version.

Using the Docker Image

  1. Install Docker
    • Docker can be seen as a “lightweight” virtual machine.Docker is available for a wide range of systems (MacOS, Windows, Linux).You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g.  yum install docker, apt-get docker-io, port install docker-io, etc.)
  2. In a terminal, Run docker run hello-world to verify that the docker installation works.
  3. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm
  4. Source the docker aliases in a terminal, this will redirect the “make”
    and “mpirun” command in the local shell to execute in the Docker machine.
    1. alias make='docker run -v $PWD:/sandbox:Z abouteiller/mpi-ft-ulfm make' alias mpirun='docker run -v $PWD:/sandbox:Z abouteiller/mpi-ft-ulfm mpirun --oversubscribe -mca btl tcp,self'
  5. Run some example to see how this works. Quick examples can be found in the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using mpirun -np 10 example

Have fun!

Categories
User Level Failure Mitigation

ULFM 2.0rc Docker package

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities.

Using the Docker Image

  1. Install Docker
    • Docker can be seen as a “lightweight” virtual machine.
    • Docker is available for a wide range of systems (MacOS, Windows, Linux).
    • You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.)
  2. In a terminal, Run

    to verify that the docker installation works.
  3. Load the pre-compiled ULFM Docker machine into your Docker installation
  4. Source the docker aliases in a terminal, this will redirect the “make”
    and “mpirun” command in the local shell to execute in the Docker machine.
  5. Run some example to see how this works. Quick examples can be found in the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using

Have fun!

Categories
User Level Failure Mitigation

ULFM 2.0

The ICL ULFM team is happy to announce ULFM 2.0 a new implementation of the MPI extension handling faults in sync with the current Open MPI master. Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. For information about what is new in Open MPI please read www.open-mpi.

Features

This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM chapter standard draft document.

  • MPIX_ERR_PROC_FAILED when a process failure prevents the completion of an MPI operation.
  • MPIX_ERR_PROC_FAILED_PENDING when a potential sender matching a non-blocking wildcard source receive has failed.
  • MPIX_ERR_REVOKED when one of the ranks in the application has invoked the MPI_Comm_revoke operation on the communicator.
  • MPIX_Comm_revoke(MPI_Comm comm) Interrupts any communication pending on the communicator at all ranks.
  • MPIX_Comm_shrink(MPI_Comm comm, MPI_Comm* newcomm) creates a new communicator where dead processes in comm were removed.
  • MPIX_Comm_agree(MPI_Comm comm, int *flag) performs a consensus (i.e. fault tolerant allreduce operation) on flag (with the operation bitwise or).
  • MPIX_Comm_failure_get_acked(MPI_Comm, MPI_Group*) obtains the group of currently acknowledged failed processes.
  • MPIX_Comm_failure_ack(MPI_Comm) acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.

Supported Systems

There are four main MPI network models available in Open MPI: “ob1”, “cm”,”yalla”, and “ucx”. Only “ob1” is adapted to support fault tolerance. “ob1” uses BTL (“Byte Transfer Layer”) components for each supported network. “ob1” supports a variety of networks that can be used in combination with each other. Collective operations (blocking and non-blocking) use an optimized implementation on top of “ob1”.

  • Loopback (send-to-self)
  • TCP
  • OpenFabrics: InfiniBand, iWARP, and RoCE
  • uGNI (Cray Gemini, Aries)
  • Shared memory Vader (FT supported w/CMA, XPmem, KNEM untested)
  • Tuned, and non-blocking collective communications

A full list of supported, untested and disabled components is provided below.

More Information

More information (tutorials, examples, build instructions for leading Top 500 systems) is also available in the Fault Tolerance ResearchHub website:https://fault-tolerance.org

Bibliographic References

If you are looking for, or want to cite a general reference for ULFM, please use

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra: Post-failure recovery of MPI communication capability: Design and rationale. IJHPCA 27(3): 244-254 (2013). Available from: http://journals.sagepub.com/doi/10.1177/1094342013488238.

Building ULFM Open MPI

There are many available configure options (see ./configure --help for a full list); a summary of the more commonly used ones is included in the upstream Open MPI README file. The following paragraph gives a summary of ULFM Open MPI specific options behavior.

Configure options

  • --with-ft=TYPE Specify the type of fault tolerance to enable. Options: mpi (ULFM MPI draft standard). Fault tolerance support is enabled by default(as if --with-ft=mpi were implicitly present on the configure line).You may specify --without-ft to compile an almost stock Open MPI.
  • --with-platform=FILE Load configure options for the build from FILE. When --with-ft=mpi is set, the file contrib/platform/ft_mpi_ulfm is loaded by default. This file disables components that are known to not be able to sustain failures, or are insufficiently tested. You may edit this file and/or force back these options on the command line to enable these components.
  • --enable-mca-no-build=LIST Comma-separated list of pairs that will not be built. For example, --enable-mca-no-build=btl-portals,oob-ud will disable building the portals BTL and the ud OOB component. When --with-ft=mpi is set, this list is populated with the content of the aforementioned platform file. You may override the default list with this parameter.
  • --with-pmi and --with-slurm Force the building of SLURM scheduler support. Slurm with fault tolerance is tested. Do not use srun, otherwise your application gets killed by the RM upon the first failure. Instead, use mpirun in an salloc/sbatch.
  • --with-sge This is untested with fault tolerance.
  • --with-tm Force the building of PBS/Torque scheduler support. PBS is tested with fault tolerance. Use mpirun in a qsub allocation.
  • --disable-mpi-thread-multiple Disable the MPI thread level MPI_THREAD_MULTIPLE (it is enabled by default). Multiple threads with fault tolerance is lightly tested.
  • --disable-oshmem Disable building the OpenSHMEM implementation (by default, it is enabled). ULFM Fault Tolerance does not apply to OpenSHMEM.

Modified, Untested and Disabled Components

Frameworks and components which are not listed in the following list are unmodified and support fault tolerance. Listed frameworks may be modified(and work after a failure), untested (and work before a failure, but may malfunction after a failure), or disabled (they prevent the fault tolerant components from operating properly).

  • pml MPI point-to-point management layer
    • “ob1” modified to handle errors
    • “monitoring”, “v” unmodified, untested
    • “bfo”, “cm”, “crcpw”, “ucx”, “yalla” disabled
  • btl Point-to-point Byte Transfer Layer
    • “openib”, “tcp”, “vader(+cma)” modified to handle errors (removed unconditional abort on error, expect performance similar to upstream)
    • “portals4”, “scif”, “smcuda”, “usnic”, “vader(+knem,+xpmem)” unmodified, untested (may work properly, please report)
  • mtl Matching Transport Layer used for MPI point-to-point messages on some types of networks
    • All “mtl” components are disabled
  • coll MPI collective algorithms
    • “tuned”, “basic”, “nbc” modified to handle errors
    • “cuda”, “fca”, “hcoll”, “portals4” unmodified, untested (expect unspecified post-failure behavior)
  • osc MPI one-sided communications
    • Unmodified, untested (expect unspecified post-failure behavior)
  • io MPI I/O and dependent components
    • fs File system functions for MPI I/O
    • fbtl File byte transfer layer: abstraction for individual read/write operations for OMPIO
    • fcoll Collective read and write operations for MPI I/O
    • sharedfp Shared file pointer operations for MPI I/O
    • All components in these frameworks are unmodified, untested (expect clean post-failure abort)
  • vprotocol Checkpoint/Restart components
    • “pml-v”, “crcp” unmodified, untested
  • wait_sync Multithreaded wait-synchronization object
    • modified to handle errors (added a global interrupt to trigger all wait_sync objects)

Running ULFM Open MPI

Building your application

As ULFM is an extension to the MPI standard, you will need to #include "mpi-ext.h" in C, or use mpi_ext in Fortran to access the supplementary error codes and functions. Compile your application as usual, using the provided mpicc, mpif90, or mpicxx wrappers.

Running your application

You can launch your application with fault tolerance by simply using the provided mpiexec. Beware that your distribution may already provide a version of MPI, make sure to set your PATH and LD_LIBRARY_PATH properly. Note that fault tolerance is enabled by default in ULFM Open MPI; you can disable all fault tolerance capabilities by launching your application with mpiexec --disable-recovery.

Running under a batch scheduler

ULFM can operate under a job/batch scheduler, and is tested routinely with both PBS and Slurm. One difficulty comes from the fact that many job schedulers will “cleanup” the application as soon as a process fails. In order to avoid this problem, it is preferred that you use mpiexec within an allocation (e.g. salloc, sbatch, qsub) rather than a direct launch (e.g. srun).

Run-time tuning knobs

ULFM comes with a variety of knobs for controlling how it runs. The default parameters are sane and should result in good performance in most cases. You can change the default settings with --mca mpi_ft_foo.

  • orte_enable_recovery &lt;true|false&gt; (default: true) controls automatic cleanup of apps with failed processes within mpirun. The default differs from upstream Open MPI.
  • mpi_ft_enable &lt;true|false&gt; (default: true) permits turning off fault tolerance without recompiling. Failure detection is disabled. Interfaces defined by the fault tolerance extensions are substituted with corresponding non-fault tolerant implementations (e.g. MPI_Allreduce is substituted to MPIX_Comm_agree).
  • mpi_ft_verbose  (default: 0) increases the output of the fault tolerance activities. A value of 1 will report detected failures.
  • mpi_ft_detector  &lt;true|false&gt; (default: false) controls the activation of the OMPI level failure detector. When this detector if turned off, all failure detection is delegated to ORTE, which may be slow and/or incomplete.
  • mpi_ft_detector_thread &lt;true|false&gt; (default: false) controls the use of a thread to emit and receive failure detector’s heartbeats. Setting this value to “true” will also set MPI_THREAD_MULTIPLE support, which has a noticeable effect on latency. You may want to enable this option if you experience false positive processes incorrectly reported as failed.
  • mpi_ft_detector_period  (default: 1e-1 seconds) heartbeat period. Recommended value is 1/3 of the timeout. Values lower than 100us may impart a noticeable effect on latency (typically a 3us increase).
  • mpi_ft_detector_timeout  (default: 3e-1 seconds) heartbeat timeout (i.e. failure detection speed). Recommended value is 3 times the heartbeat period.

Known Limitations in ULFM-2.0-rc (2e75c73):

  • TOPO, FILE, RMA have little support for fault tolerance mechanisms.
  • There is a tradeoff between failure detection accuracy and performance. The current default is to favor performance. Users that experience detection accuracy issues may enable a more precise mode.
  • The failure detector operates on MPI_COMM_WORLD exclusively. Processes connected from MPI_COMM_CONNECT/ACCEPT and MPI_COMM_SPAWN may occasionally not be detected when they fail.
  • Failures during NBC collective may not be recovered properly in some cases.
  • Return of OpenIB credits spent toward a failed process can take several seconds. Until the btl_openib_ib_timeout and btl_openib_ib_retry_count controlled timeout triggers, lack of send credits may cause a temporary stall under certain communication patterns. Impacted users can try to adjust the value of these mca parameters.

Changelog

Release 2.0

Focus has been toward integration with current Open MPI master, performance, and stability.

  • ULFM is now based upon Open MPI master branch (#689f1be9). It will be regularly updated until eventually merged.
  • Fault Tolerance is enabled by default and is controlled with MCA variables.
  • Added support for multithreaded modes (MPI_THREAD_MULTIPLE, etc.)
  • Added support for non-blocking collective operations (NBC).
  • Added support for CMA shared memory transport (Vader).
  • Added support for advanced failure detection at the MPI level. Implements the algorithm described in “Failure detection and propagation in HPC systems.” https://doi.org/10.1109/SC.2016.26.
  • Removed the need for special handling of CID allocation.
  • Non-usable components are automatically removed from the build during configure
  • RMA, FILES, and TOPO components are enabled by default, and usage in a fault tolerant execution warns that they may cause undefined behavior after a failure.
  • Bugfixes:
    • Code cleanup and performance cleanup in non-FT builds; –without-ft at configure time gives an almost stock Open MPI.
    • Code cleanup and performance cleanup in FT builds with FT runtime disabled; –mca ft_enable_mpi false thoroughly disables FT runtime activities.
    • Some error cases would return ERR_PENDING instead of ERR_PROC_FAILED in collective operations.
    • Some test could set ERR_PENDING or ERR_PROC_FAILED instead of ERR_PROC_FAILED_PENDING for ANY_SOURCE receptions.

Release 1.1

Focus has been toward improving stability, feature coverage for intercomms, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

  • Forked from Open MPI 1.5.5 devel branch
  • Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
  • Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed -extension- names MPIX_ERR_xxx.
  • Support for Intercommunicators:
    • Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
    • MPI_COMM_REVOKE tested on intercommunicators.
  • Disabled completely (.ompi_ignore) many untested components.
  • Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
  • Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
  • Bugfixes:
    • SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
    • SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
    • Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
    • Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
    • Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

Release 1.0

Focus has been toward improving performance, both before and after the occurrence of failures. The list of new features includes:

  • Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
  • Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
  • New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK.
  • New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks.
  • Improved support for our traditional network layer:
    • TCP: fully tested
    • SM: fully tested (with the exception of XPMEM, which remains unsupported)
  • Added support for High Performance networks
    • Open IB: reasonably tested
    • uGNI: reasonably tested
  • The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
    • Back-ported PBS/ALPS fixes from Open MPI
    • Back-ported OpenIB bug/performance fixes from Open MPI
    • Improve Context ID allocation algorithm to reduce overheads of Shrink
    • Miscellaneous bug fixes

Version Numbers and Binary Compatibility

Starting from ULFM Open MPI version 2.0, ULFM Open MPI is binary compatible with the corresponding Open MPI master branch and compatible releases (see the binary compatibility and version number section in the upstream Open MPI README). That is, applications compiled with a compatible Open MPI can run with the ULFM Open MPI mpirun and MPI libraries. Conversely, as long as the application does not employ one of the MPIX functions, which are exclusively defined in ULFM Open MPI, an application compiled with ULFM Open MPI can be launched with a compatible Open MPI mpirun and run with the non-fault tolerant MPI library.

Contacting the Authors

Found a bug? Got a question? Want to make a suggestion? Want to contribute to ULFM Open MPI? Working on a cool use-case? Please let us know! The best way to report bugs, send comments, or ask questions is to sign up on the user’s mailing list: ulfm+subscribe@googlegroups.com Because of spam, only subscribers are allowed to post to these lists (ensure that you subscribe with and post from exactly the same e-mail address — joe@example.com is considered different thanjoe@mycomputer.example.com!). Visit these pages to subscribe to the lists: https://groups.google.com/forum/#!forum/ulfm When submitting questions and problems, be sure to include as much information as possible. This web page details all the information that we request in order to provide assistance: http://www.open-mpi.org/community/help/

Copyright