SC’15 tutorial

Featured

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults.

See you all in Austin, TX !!!

ULFM 1.0 Announced

Featured

The major 1.0 milestone has been reached for the User Level Failure Mitigation compliant fault tolerant MPI.

We have focused on improving performance, both before and after the occurence of failures. The list of new features includes:

• Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
• Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
• New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK. Meet us at SC’15 to  learn more about the novel algorithm we designed!
• New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks. Meet us at EuroMPI’15 to learn more about the Revoke algorithm we designed!
• Improved support for our traditional network layer:
• TCP: fully tested
• SM: fully tested (with the exception of XPMEM, which remains unsupported)
• Added support for High Performance networks
• Open IB: reasonably tested
• uGNI: reasonably tested
• The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
• Back-ported PBS/ALPS fixes from Open MPI
• Back-ported OpenIB bug/performance fixes from Open MPI
• Improve Context ID allocation algorithm to reduce overheads of Shrink
• Miscellaneous bug fixes (look at the commit log for the full list).

Fault tolerance support for RMA and IO is still under development.

Get the source and happy hacking,
The ULFM team

Logarithmic Revoke Routine

Starting with ULFM-1.0, the implementation features a logarithmic revoke operation, with a logarithmically bound per-node communication degree. A paper presenting this implementation will be presented at EuroMPI’15.

The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.

Purpose of the Revoke Operation

If the communication pattern of the application is complex, the occurrence of failures has the potential to deeply disturb the application and prevent an effective recovery from being implemented. Consider the example in the above figure: as long as no failure occurs, the processes are communicating in a point-to-point pattern (we decide to call plan A). Process Pk is waiting to receive a message from Pk-1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing a failure in plan A, and consider that P1 has failed. As only P2 communicates directly with P1, other processes do not detect this condition, and only P2 is informed of the failure of P1. The situation at P2 now raises a dilemma: P3 waits on P2, a non-failed process, therefore the operation must block until the matching send is posted at P2; however, P2 knows that P1 has failed, and that the application should branch into its recovery procedure plan B; if P2 were to switch abruptly to plan B, it would cease matching the receives P3 posted following plan A. At this point, P2 needs an effective way of interrupting operations that it does not intend to match anymore, otherwise, the application would reach a deadlock: the messages that P3 to Pn are waiting for will never arrive.

The proposed solution to resolve this scenario is that, before switching to plan B, the user code in P2 calls MPI_COMM_REVOKE, a new API which notifies all other processes in the communicator that a condition requiring recovery actions has been reached. Thanks to this flexibility, the cost associated with consistency in error reporting is paid only after an actual failure has happened, and only when necessary to the algorithm, and applications that do not need consistency, or in which the user can prove that the communication pattern remains safe, can enjoy better recovery performance.

When a process of the application calls MPI_COMM_REVOKE (similar operations exist for windows and files), all other alive processes in the communicator eventually receive a notification. The MPI_COMM_REVOKE call has an effect on the entire scope of the communicator, without requiring a collective or matching call at any participant. Instead, the effect of the Revoke operation is observed at other processes during non-matching MPI communication calls: when receiving this notification, any communication on the communicator (ongoing or future) is interrupted and a special error code returned. Then, all surviving processes can safely enter the recovery procedure of the application, knowing that no alive process belonging to that communicator will deadlock as a result.

The revocation notification needs to be propagated to all alive processes in the specified communicator, even when new failures happen during the Revoke propagation. Therefore, it is in essence a reliable broadcast. Among the four defining qualities of a reliable broadcast usually considered in the literature (Termination, Validity, Integrity, Agreement) [1], the non-uniform variants of the properties are sufficient, and the integrity criteria can be relaxed in the context of the Revoke algorithm. These simplified requirements are crucial for decreasing the cost of the Revoke operation, as the size of the messages and the number of message exchanges rounds can be drastically increased when one needs to implement an ordered, uniform reliable broadcast. Given the non-uniform agreement, the no-ordering, and loose integrity properties, in the Revoke reliable broadcast, a process that receives its first Revoke message can perform a single round of emissions to all its neighbors, with a constant message size, and then deliver the Revoke notification immediately, without further verification.

The last important aspect is the topology of the overlay network employed to perform the broadcast operation. In the reliable broadcast algorithm, when a process receives a broadcast message for the first time, it immediately broadcasts that same message to all its neighbors in the overlay graph. The agreement property can be guaranteed only when failures do not disconnect the overlay graph. In early prototype versions of the ULFM implementation, the reliable broadcast procedure employed a fully connected network (which guarantees that disconnected cliques never form). Obviously, this method scales poorly as, with the number or processes, the graph degree is linear, and the number of exchanged messages is quadratic. In practice, at scale, the large graph degree resulted in the application aborting due to resource exhaustion (too many open channels simultaneously, not enough memory for unexpected messages, etc.). Therefore, one needs to consider a more scalable overlay topology with a low graph degree that can yet maintain connectivity when nodes are suppressed.

Binomial Graph Overlay Topology

The Binomial Graph (BMG), introduced in [2], is a topology that features both opposing traits of a small degree, yet a strong resistance to the formation of disconnected cliques when nodes fail. A BMG is an undirected graph G=(V,E), where the vertices V represent a set of processes, and the edges E are a set of links forming an overlay network between these processes. Each vertex is given a unique identifier (i.e. the rank of the process). For each vertex v, there is a link to a set of vertices $$W={v+1, v+2, …, v+2^k~|~2^k < n}$$ Intuitively, a binomial graph can be seen as the union of all the binomial trees rooted at all vertices.

The BMG topology is proven to feature several desirable properties. It is a regular graph topology, in which all nodes have the same degree, even in graphs with unremarkable number of vertices. The degree is logarithmic with the number of nodes, therefore scalable. Meanwhile, it retains a small diameter and a small average distance (in number of hops) between any two nodes (also logarithmic). In addition, a binomial broadcast tree rooted at any node can be naturally extracted from a BMG.

Last, the BMG topology has a high node-connectivity —the minimum number of nodes whose removal can result in disconnecting the network—, which is equal to the degree D. As a consequence the BMG is D-1 node fault tolerant in all cases (an optimal result for a graph of degree D). The probability distribution for the formation of a disconnected graph when D or more failures happen is very favorable (the failures have to strike a particular set of nodes, in a variant of the generalized birthday problem).

Evaluation

Because of its asymmetrical nature, the impact of the Revoke call cannot be measured directly. At the initiator, the call only starts a non-synchronizing wave of token circulation, and measuring the very short duration of the call is not representative of the actual time required for the Revoke call to operate at all target processes. Measuring the time needed for a particular operation to be interrupted gives a better estimate of the propagation time of a Revoke notification. However, the overall impact remains underestimated if one doesn’t account for the fact that even after all processes have successfully delivered a Revoke notification, the reliable broadcast algorithm continues to emit and handle Revoke messages in the background for some time.

The benchmark we designed measures both the duration and the perturbation generated by the progress of a Revoke operation on the network. The benchmark comprises two communication plans. Plan A is a loop that performs a given collective operation on a communicator commA. At some iteration, an initiator process does not match the collective operation, but, instead, Revokes commA, which effectively ends plan A. Plan B is a similar loop performing the same collective operation in a duplicate communicator commB (thus the Revoke operation on commA does not interrupt operations in plan B). The first (and last) operation in plan A gives an estimate of the Revoke propagation time. Operations in plan B estimate the residual perturbation from echoing messages in the reliable broadcast.

The above figure presents the scalability of the AllReduce collective communication in the Revoke benchmark on the Darter Cray xc30 supercomputer (tuned collective module, uGNI transport). The first observation is that the performance of post-Revoke collective communications follows the same scalability trend as the pre-Revoke operations, even those impacted by jitter. Aside from the first post-Revoke AllReduce communication, which still exhibit a moderate overhead from jitter, the second post-Revoke AllReduce is only mildly impacted and the third AllReduce exhibit no significant difference from the failure free case, illustrating that the jitter introduced by the reliable broadcast algorithm has a low impact on this communication pattern. When the number of processes increases, the impact of jitter —the difference between the failure-free and the first post-Revoke operation— is almost constant (or slightly decreasing). If this trend were to continue at larger scales, the impact of jitter could become asymptotically negligible.

[1] V. Hadzilacos and S. Toueg. Fault-tolerant broadcasts and related problems. In S. Mullender, editor, Distributed systems (2nd Ed.), chapter 5, pages 97–145. ACM/Addison-Wesley, 1993.

[2] T. Angskun, G. Bosilca, and J. Dongarra. Binomial graph: A scalable and fault-tolerant logical network topology. In I. Stojmenovic, R. Thulasiram, L. Yang, W. Jia, M. Guo, and R. de Mello, editors, Parallel and Distributed Processing and Applications, volume 4742 of Lecture Notes in Computer Science, pages 471–482. Springer Berlin Heidelberg, 2007. Continue reading

Logarithmic Agreement Routine

Starting with ULFM-1.0, the implementation features a purely logarithmic agreement, with no single point of failure. A paper presenting this implementation will be presented during SC’15.

We considered a practical agreement algorithm with the following desired properties:

1. the unique decided value is the result of a combination of all values proposed by deciding processes (a major difference with a 1-set agreement),
2. failures consist of permanent crashes in a pseudo-synchronous system (no data corruption, loss of message, or malicious behaviors are considered),
3. the agreement favors the failure-free performance over the failure case, striving to exchange a logarithmic number of messages in the absence of failures.

To satisfy this last requirement, we introduced a practical, intermediate property, called Early Returning: that is the capacity of an early deciding algorithm to return before the stopping condition (early or not) is guaranteed: as soon as a process can determine that the decision value is fixed (except if it fails itself), the process is allowed to return. However, because the process is allowed to return early, later failures may compel that process to participate in additional communications.  Therefore, the decision must remain available after the processes return, in order to serve unexpected message exchanges until the stopping condition can be established. Unlike a regular early stopping algorithm, not all processes decide and stop at the same round, and some processes participate in more message exchanges — depending on the location of failed processes in the logarithmic communication topology.

Through synthetic benchmarks and real applications, we investigated the ERA costs, and highlighted the correspondence between the theoretical and practical logarithmic behavior of the proposed algorithm and the implementation. We have shown that using this algorithm, it is possible to design efficient applications, or domain specific fault mitigation building blocks that preserve the original application performance, while augmenting the capability of the applications to execute on future platforms where large scale may reduce reliability.

Performance of the Early Returning Agreement:

In the figure above, we present the scalability trend of ERA when no failures are disturbing the system.  We consider two different agreement implementations, 1) the known state-of-the-art 2-phase-commit Agreement algorithm presented in [1], called Log2phases, and 2) our best performing version of ERA. We also add, for reference, the performance of an Allreduce operation that in a failure-free context would have had the same outcome as the agreement. On the darter machine using one process per core, thus 16 processes per node, the average branching degree of non-leaf nodes is 2.125. The ERA and the Allreduce operations both exhibit a logarithmic trend when the number of nodes increase, as can be observed by the close fit (asymptotic standard error of $$0.6%$$) of the logarithmic function $$era(x)=6.7 \log_{2.125}(x)$$.

In contrast, the Log2phases algorithm exhibits a linear scaling with  the number of nodes, despite the expected theoretical bound proposed in [1]. Further evaluation show that some linear local computations during the agreement operation end up dominating the execution time, despite the logarithmic number of messages.

In the figure above, we compare the performance of different architecture-aware of ERA employing different tree topologies to perform the reduce/broadcast operations that are the basis of an ERA phase. In the flat binary tree, all ranks are organized in a binary tree, regardless of the hardware locality of ranks collocated on cores of the same node. In the hierarchical methods, one rank represents the node and participates in the inter-node binary tree; on each node, collocated ranks are all children of the representing rank in the bin/star method, or are organized along a node-local binary tree in the bin/bin method. The flat binary topology ERA and the Open MPI Allreduce are both hardware locality agnostic; their performance profiles are extremely similar. In contrast, the Cray Allreduce exhibits a better scalability thanks to accounting for the locality of ranks. Counterintuitively, the bin/star hierarchical topology performs worse than the flat binary tree: the representing rank for a node has 16 local children and the resulting 16 sequential memcpy operations (on the shared-memory device) take longer than the latency to cross the supplementary long-range links. In the bin/bin topology, most of these memory copies are parallelized and henceforth the overall algorithm scales logarithmically. When compared with the fully optimized, non fault tolerant Allreduce, the latency is doubled, which is a logical consequence of the need for the ERA operation to sequentialize a reduce and a broadcast that do not overlap (to ensure the consistent decision criterion in the failure case), while the Allreduce operation is spared that requirement and can overlap multiple interweaved reductions.

Performance of Applications using the Early Returning Agreement:

S3D is a highly parallel method-of-lines solver for partial differential equations and is used to perform first-principles-based direct numerical simulations of turbulent combustion. It employs high order explicit finite difference numerical schemes for spatial and temporal derivatives, and realistic physics for thermodynamics, molecular transport, and chemical kinetics.

Fenix is a framework aimed at enabling online (i.e., without disrupting the job) and transparent recovery from process, node, blade, and cabinet failures for parallel applications in an efficient and scalable manner. Fenix encapsulates mechanisms to transparently capture failures through ULFM return codes, re-spawn new processes on spare nodes when possible, fix failed communicators using ULFM capabilities, restore application state, and return the execution control back to the application.

We use S3D augmented with Fenix to test the effect of the new agreement algorithm on the recovery process when compared to the baseline agreement algorithm. The figures below shows the results of these experiments in terms of total absolute cost of each call to the shrink operation. The MPIX_COMM_SHRINK operation, which uses the agreement algorithm, has been identified in [2] as the most time consuming operation of the recovery process.

In the figure above, we see how the operation scales with an increasing number of failures, from one node (16 cores) up to 64 nodes (1024 cores). We observe the drastic impact of the new ERA agreement compared with the previous Log2phases algorithm, and the absolute time is clearly smaller with the new agreement algorithm, in all cases.

MiniFE is part of the Mantevo mini-applications suite that represents computation and communication patterns of large scale parallel finite element analysis applications that serve a wide spectrum of HPC research. MiniFE has been integrated with a resilient application framework called LFLR (Local Failure Local Recovery), which leverages ULFM to allow on-line application recovery from process loss without the traditional checkpoint/restart (C/R).

The figure above presents the performance of communicator recovery, executed after a process failure has been detected. The ERA agreement achieves significantly better scaling than the Log2phases algorithm, imposing a low cost on the recovery process.

[1] J. Hursey, T. Naughton, G. Vallee, and R. L. Graham. A Log-scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. In Proceedings of the 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11, pages 255–263, Berlin, Heidelberg, 2011. Springer-Verlag.

[2] M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’14, 2014.

Tutorial @ SC’14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

Reliability is one of the major concerns when envisioning the future Exascale platforms. The IESP projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates, even under the most optimistic perspectives, in a mechanical decrease of the mean time to interruption (MTTI) of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due to a failure as a rare dramatic event, but must consider faults inevitable and take them into account by integrating some form of fault-tolerance.

One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.

In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
2. Untar it in some convenient location.

3. Go inside the newly untarred directory and launch

4. Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.

6. Add the directory provided as a –prefix in Step 4 to your PATH and LD_LIBRARY_PATH. As an example for bash one can add the following two lines to \${HOME}/.bashrc

7. You’re almost ready for the hand-on session.
8. The archive with the examples and skeletons is available here.

Let’s rock the faults!

The slides used during this tutorial are available here, here and here.

Uniform Intercomm Creation

A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function.

Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6):

This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag.

In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like

So far so good, but what “uniformly” means? Based on some web resources uniformly is about consistency, in this particular instance about the fact that all participants will return the same answer (aka. the some inter-communicator or the same error code) despite all odds. Here we are looking specifically from the point of view of process failures, more precisely a single failure. Obviously, the trivial approach where one agrees on the resulting intercomm doesn’t work, as due to process failures some processes might not have an intercomm. Clearly, we need a different, better approach.

Looks quite simple and straightforward. Does it do reach the expected consensus at the end? Unfortunately not, it doesn’t work in all cases. Imagine that all peers in one side succeeded to create the intercomm while everyone in the remote communicator failed. Then on one side the interexists is TRUE while on the other side it is FALSE. Thus the local barrier (line 3) succeed on the side with interexists equal to TRUE and the intercomm is not revoked. As the intercomm exists, these processes will then go in the second MPI_Barrier (line 10), on the “half-created” intercomm, and will deadlock as there will never be a remote process to participate to the barrier.

What we are missing here is an agreement between the two sides (A and B) that things went well for everyone. So let’s try to use agreement to reach this goal. In the remaining of this post I will use the function AGREE, which is a point-to-point agreement between two processes (the two leaders). We need this particular function, as we cannot call MPI_Comm_agree on the bridge communicator, simply because the two leaders are not the only participants in the bridge communicator. Fortunately, implementing an agreement between two processes is almost a trivial task, so I will consider the AGREE function as existing.

If there is no failure the code works, but it’s an overkill. Now, if we have one failure during the execution there are several cases.
1. The failure happens before the MPI_Intercomm_create. Some processes will have an intercomm, and some will not. Thus the local agreement will make everyone on the same group (A and B) agree about the local existence of the intercomm. The AGREE between the leaders will then propagate this knowledge to the remote group leader. At this point the two leaders have the correct answer, and then the MPI_Comm_agree at line 8 will finally distribute this answer on the two groups. This case works!
2. The failure happen after the MPI_Intercomm_create. One can think about this case as a false positive (because the intercomm exists on all alive processes), but the real question is if we are able to detect it and reach a consensus. The MPI_Comm_agree at line 3 will make the flag TRUE on both groups. Note that on one side the MPI_Comm_agree will return MPI_ERR_PROC_FAILED as required by the MPI standard, but we do not take in account the return code here.

Here is the reasoning leading to this form of the algorithm.
1. The MPI_Comm_agree on line 3 is a little tricky. We use the returned value of the flag (bitwise AND operation between all living processes), but we willingly choose to ignore the return value. What really matters is that all alive participants have the resulting intercomm, and this will be reflected in the value of flag after the call. All other cases are trapped either by partially existing intercomm, or by the REVOKE call. Moreover, it is extremely important that the resulting value of flag is computed through an agreement and not through a traditional collective communication (which lack the consistency factor).
2. For a similar reason the MPI_Comm_agree on line 9 cannot be replaced by a MPI_Broadcast, even as its meaning is to broadcast the information collected at the local root into the local group of processes. Without the use of MPI_Comm_agree here one cannot distinguish between every process still alive has an intercomm with known faults or only some participants have an intercomm (with faults). Thus, as the intercomm can be revoked only if all alive processes know about it, we have to enforce this consistent view by the use of the agreement.

While this inter-communicator creation looks like an extremely complicated and expensive matter, one should keep in mind that such communicator creation are not supposed to be in the critical path, so they might not be as prohibitive as one might think. If Everything Was Easy Nothing Would Be Worth It!

ANU presents PDE solver with ULFM at IPDPS

Mohsin Ali and Peter Strazdins presented their work on “Application Level Fault Recovery, Using Fault-Tolerant Open MPI in a PDE Solver”, during the IPDPS PDSEC workshop, last week. See the full slides for more details.

This novel work joins the growing list of applications benefiting from ULFM to feature fault tolerance; more examples are presented in these applications slides.