Categories

# Logarithmic Agreement Routine

Starting with ULFM-1.0, the implementation features a purely logarithmic agreement, with no single point of failure. A paper presenting this implementation will be presented during SC’15.

We considered a practical agreement algorithm with the following desired properties:

1. the unique decided value is the result of a combination of all values proposed by deciding processes (a major difference with a 1-set agreement),
2. failures consist of permanent crashes in a pseudo-synchronous system (no data corruption, loss of message, or malicious behaviors are considered),
3. the agreement favors the failure-free performance over the failure case, striving to exchange a logarithmic number of messages in the absence of failures.

To satisfy this last requirement, we introduced a practical, intermediate property, called Early Returning: that is the capacity of an early deciding algorithm to return before the stopping condition (early or not) is guaranteed: as soon as a process can determine that the decision value is fixed (except if it fails itself), the process is allowed to return. However, because the process is allowed to return early, later failures may compel that process to participate in additional communications.  Therefore, the decision must remain available after the processes return, in order to serve unexpected message exchanges until the stopping condition can be established. Unlike a regular early stopping algorithm, not all processes decide and stop at the same round, and some processes participate in more message exchanges — depending on the location of failed processes in the logarithmic communication topology.

Through synthetic benchmarks and real applications, we investigated the ERA costs, and highlighted the correspondence between the theoretical and practical logarithmic behavior of the proposed algorithm and the implementation. We have shown that using this algorithm, it is possible to design efficient applications, or domain specific fault mitigation building blocks that preserve the original application performance, while augmenting the capability of the applications to execute on future platforms where large scale may reduce reliability.

Performance of the Early Returning Agreement:

In the figure above, we present the scalability trend of ERA when no failures are disturbing the system.  We consider two different agreement implementations, 1) the known state-of-the-art 2-phase-commit Agreement algorithm presented in [1], called Log2phases, and 2) our best performing version of ERA. We also add, for reference, the performance of an Allreduce operation that in a failure-free context would have had the same outcome as the agreement. On the darter machine using one process per core, thus 16 processes per node, the average branching degree of non-leaf nodes is 2.125. The ERA and the Allreduce operations both exhibit a logarithmic trend when the number of nodes increase, as can be observed by the close fit (asymptotic standard error of $$0.6%$$) of the logarithmic function $$era(x)=6.7 \log_{2.125}(x)$$.

In contrast, the Log2phases algorithm exhibits a linear scaling with  the number of nodes, despite the expected theoretical bound proposed in [1]. Further evaluation show that some linear local computations during the agreement operation end up dominating the execution time, despite the logarithmic number of messages.

In the figure above, we compare the performance of different architecture-aware of ERA employing different tree topologies to perform the reduce/broadcast operations that are the basis of an ERA phase. In the flat binary tree, all ranks are organized in a binary tree, regardless of the hardware locality of ranks collocated on cores of the same node. In the hierarchical methods, one rank represents the node and participates in the inter-node binary tree; on each node, collocated ranks are all children of the representing rank in the bin/star method, or are organized along a node-local binary tree in the bin/bin method. The flat binary topology ERA and the Open MPI Allreduce are both hardware locality agnostic; their performance profiles are extremely similar. In contrast, the Cray Allreduce exhibits a better scalability thanks to accounting for the locality of ranks. Counterintuitively, the bin/star hierarchical topology performs worse than the flat binary tree: the representing rank for a node has 16 local children and the resulting 16 sequential memcpy operations (on the shared-memory device) take longer than the latency to cross the supplementary long-range links. In the bin/bin topology, most of these memory copies are parallelized and henceforth the overall algorithm scales logarithmically. When compared with the fully optimized, non fault tolerant Allreduce, the latency is doubled, which is a logical consequence of the need for the ERA operation to sequentialize a reduce and a broadcast that do not overlap (to ensure the consistent decision criterion in the failure case), while the Allreduce operation is spared that requirement and can overlap multiple interweaved reductions.

Performance of Applications using the Early Returning Agreement:

S3D is a highly parallel method-of-lines solver for partial differential equations and is used to perform first-principles-based direct numerical simulations of turbulent combustion. It employs high order explicit finite difference numerical schemes for spatial and temporal derivatives, and realistic physics for thermodynamics, molecular transport, and chemical kinetics.

Fenix is a framework aimed at enabling online (i.e., without disrupting the job) and transparent recovery from process, node, blade, and cabinet failures for parallel applications in an efficient and scalable manner. Fenix encapsulates mechanisms to transparently capture failures through ULFM return codes, re-spawn new processes on spare nodes when possible, fix failed communicators using ULFM capabilities, restore application state, and return the execution control back to the application.

We use S3D augmented with Fenix to test the effect of the new agreement algorithm on the recovery process when compared to the baseline agreement algorithm. The figures below shows the results of these experiments in terms of total absolute cost of each call to the shrink operation. The MPIX_COMM_SHRINK operation, which uses the agreement algorithm, has been identified in [2] as the most time consuming operation of the recovery process.