Logarithmic Agreement Routine

Starting with ULFM-1.0, the implementation features a purely logarithmic agreement, with no single point of failure. A paper presenting this implementation will be presented during SC’15.

We considered a practical agreement algorithm with the following desired properties:

  1. the unique decided value is the result of a combination of all values proposed by deciding processes (a major difference with a 1-set agreement),
  2. failures consist of permanent crashes in a pseudo-synchronous system (no data corruption, loss of message, or malicious behaviors are considered),
  3. the agreement favors the failure-free performance over the failure case, striving to exchange a logarithmic number of messages in the absence of failures.

To satisfy this last requirement, we introduced a practical, intermediate property, called Early Returning: that is the capacity of an early deciding algorithm to return before the stopping condition (early or not) is guaranteed: as soon as a process can determine that the decision value is fixed (except if it fails itself), the process is allowed to return. However, because the process is allowed to return early, later failures may compel that process to participate in additional communications.  Therefore, the decision must remain available after the processes return, in order to serve unexpected message exchanges until the stopping condition can be established. Unlike a regular early stopping algorithm, not all processes decide and stop at the same round, and some processes participate in more message exchanges — depending on the location of failed processes in the logarithmic communication topology.

Through synthetic benchmarks and real applications, we investigated the ERA costs, and highlighted the correspondence between the theoretical and practical logarithmic behavior of the proposed algorithm and the implementation. We have shown that using this algorithm, it is possible to design efficient applications, or domain specific fault mitigation building blocks that preserve the original application performance, while augmenting the capability of the applications to execute on future platforms where large scale may reduce reliability.

Performance of the Early Returning Agreement:

Download (PDF, 20KB)

In the figure above, we present the scalability trend of ERA when no failures are disturbing the system.  We consider two different agreement implementations, 1) the known state-of-the-art 2-phase-commit Agreement algorithm presented in [1], called Log2phases, and 2) our best performing version of ERA. We also add, for reference, the performance of an Allreduce operation that in a failure-free context would have had the same outcome as the agreement. On the darter machine using one process per core, thus 16 processes per node, the average branching degree of non-leaf nodes is 2.125. The ERA and the Allreduce operations both exhibit a logarithmic trend when the number of nodes increase, as can be observed by the close fit (asymptotic standard error of \(0.6%\)) of the logarithmic function \(era(x)=6.7 \log_{2.125}(x)\).

In contrast, the Log2phases algorithm exhibits a linear scaling with  the number of nodes, despite the expected theoretical bound proposed in [1]. Further evaluation show that some linear local computations during the agreement operation end up dominating the execution time, despite the logarithmic number of messages.

Download (PDF, 20KB)

In the figure above, we compare the performance of different architecture-aware of ERA employing different tree topologies to perform the reduce/broadcast operations that are the basis of an ERA phase. In the flat binary tree, all ranks are organized in a binary tree, regardless of the hardware locality of ranks collocated on cores of the same node. In the hierarchical methods, one rank represents the node and participates in the inter-node binary tree; on each node, collocated ranks are all children of the representing rank in the bin/star method, or are organized along a node-local binary tree in the bin/bin method. The flat binary topology ERA and the Open MPI Allreduce are both hardware locality agnostic; their performance profiles are extremely similar. In contrast, the Cray Allreduce exhibits a better scalability thanks to accounting for the locality of ranks. Counterintuitively, the bin/star hierarchical topology performs worse than the flat binary tree: the representing rank for a node has 16 local children and the resulting 16 sequential memcpy operations (on the shared-memory device) take longer than the latency to cross the supplementary long-range links. In the bin/bin topology, most of these memory copies are parallelized and henceforth the overall algorithm scales logarithmically. When compared with the fully optimized, non fault tolerant Allreduce, the latency is doubled, which is a logical consequence of the need for the ERA operation to sequentialize a reduce and a broadcast that do not overlap (to ensure the consistent decision criterion in the failure case), while the Allreduce operation is spared that requirement and can overlap multiple interweaved reductions.

Performance of Applications using the Early Returning Agreement:

S3D is a highly parallel method-of-lines solver for partial differential equations and is used to perform first-principles-based direct numerical simulations of turbulent combustion. It employs high order explicit finite difference numerical schemes for spatial and temporal derivatives, and realistic physics for thermodynamics, molecular transport, and chemical kinetics.

Fenix is a framework aimed at enabling online (i.e., without disrupting the job) and transparent recovery from process, node, blade, and cabinet failures for parallel applications in an efficient and scalable manner. Fenix encapsulates mechanisms to transparently capture failures through ULFM return codes, re-spawn new processes on spare nodes when possible, fix failed communicators using ULFM capabilities, restore application state, and return the execution control back to the application.

We use S3D augmented with Fenix to test the effect of the new agreement algorithm on the recovery process when compared to the baseline agreement algorithm. The figures below shows the results of these experiments in terms of total absolute cost of each call to the shrink operation. The MPIX_COMM_SHRINK operation, which uses the agreement algorithm, has been identified in [2] as the most time consuming operation of the recovery process.

Download (PDF, 21KB)

In the figure above, we see how the operation scales with an increasing number of failures, from one node (16 cores) up to 64 nodes (1024 cores). We observe the drastic impact of the new ERA agreement compared with the previous Log2phases algorithm, and the absolute time is clearly smaller with the new agreement algorithm, in all cases.

MiniFE is part of the Mantevo mini-applications suite that represents computation and communication patterns of large scale parallel finite element analysis applications that serve a wide spectrum of HPC research. MiniFE has been integrated with a resilient application framework called LFLR (Local Failure Local Recovery), which leverages ULFM to allow on-line application recovery from process loss without the traditional checkpoint/restart (C/R).

Download (PDF, 42KB)

The figure above presents the performance of communicator recovery, executed after a process failure has been detected. The ERA agreement achieves significantly better scaling than the Log2phases algorithm, imposing a low cost on the recovery process.

[1] J. Hursey, T. Naughton, G. Vallee, and R. L. Graham. A Log-scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. In Proceedings of the 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11, pages 255–263, Berlin, Heidelberg, 2011. Springer-Verlag.

[2] M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’14, 2014.

ULFM 1.0 Announced

The major 1.0 milestone has been reached for the User Level Failure Mitigation compliant fault tolerant MPI.

We have focused on improving performance, both before and after the occurence of failures. The list of new features includes:

  • Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
  • Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
  • New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK. Meet us at SC’15 to  learn more about the novel algorithm we designed!
  • New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks. Meet us at EuroMPI’15 to learn more about the Revoke algorithm we designed!
  • Improved support for our traditional network layer:
    • TCP: fully tested
    • SM: fully tested (with the exception of XPMEM, which remains unsupported)
  • Added support for High Performance networks
    • Open IB: reasonably tested
    • uGNI: reasonably tested
  • The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
  • Back-ported PBS/ALPS fixes from Open MPI
  • Back-ported OpenIB bug/performance fixes from Open MPI
  • Improve Context ID allocation algorithm to reduce overheads of Shrink
  • Miscellaneous bug fixes (look at the commit log for the full list).

Fault tolerance support for RMA and IO is still under development.

Get the source and happy hacking,
The ULFM team