Try the Docker packaged ULFM fault tolerant MPI

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own.


1. A Docker Image with a precompiled version of ULFM Open MPI 1.1.
2. The tutorial hands-on example.
3. Various tests and benchmarks for resilient operations.
4. The sources for the ULFM Open MPI branch release 1.1.

Using the Docker Image

1. Install Docker
You can install Docker quickly, either by downloading one of the official builds from for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation instructions for your system.
2. In a terminal, verify that the docker installation works by running

3. Unpack the package:

3. Load the pre-compiled ULFM Docker machine into your Docker installation:

4. Source the docker aliases, which will redirect the “make” and “mpirun” command in this terminal’s local shell to execute the provided commands from the Docker machine.

5. Go to the tutorial examples directory. You can now type make to compile the examples using the Docker provided “mpicc”, and you can execute the generated examples in the Docker machine using mpirun -am ft-enable-mpi -np 10 example. Note the special -am ft-enable-mpi parameter; if this parameter is omitted, the non-fault tolerant version of Open MPI is launched and applications containing failures will automatically abort.

Have fun!

ULFM Specification update

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meetings in 2016 has been posted under the ULFM Specification item.

This new update has very few semantic changes. It clarifies the failure behavior of MPI_COMM_SPAWN, and corrects the output values of error codes and status objects returned from functions completing in error.

Head to ULFM Specification for more info.

Logarithmic Revoke Routine

Starting with ULFM-1.0, the implementation features a logarithmic revoke operation, with a logarithmically bound per-node communication degree. A paper presenting this implementation will be presented at EuroMPI’15.

The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods.

Purpose of the Revoke Operation

Download (PDF, 82KB)

If the communication pattern of the application is complex, the occurrence of failures has the potential to deeply disturb the application and prevent an effective recovery from being implemented. Consider the example in the above figure: as long as no failure occurs, the processes are communicating in a point-to-point pattern (we decide to call plan A). Process Pk is waiting to receive a message from Pk-1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing a failure in plan A, and consider that P1 has failed. As only P2 communicates directly with P1, other processes do not detect this condition, and only P2 is informed of the failure of P1. The situation at P2 now raises a dilemma: P3 waits on P2, a non-failed process, therefore the operation must block until the matching send is posted at P2; however, P2 knows that P1 has failed, and that the application should branch into its recovery procedure plan B; if P2 were to switch abruptly to plan B, it would cease matching the receives P3 posted following plan A. At this point, P2 needs an effective way of interrupting operations that it does not intend to match anymore, otherwise, the application would reach a deadlock: the messages that P3 to Pn are waiting for will never arrive.

The proposed solution to resolve this scenario is that, before switching to plan B, the user code in P2 calls MPI_COMM_REVOKE, a new API which notifies all other processes in the communicator that a condition requiring recovery actions has been reached. Thanks to this flexibility, the cost associated with consistency in error reporting is paid only after an actual failure has happened, and only when necessary to the algorithm, and applications that do not need consistency, or in which the user can prove that the communication pattern remains safe, can enjoy better recovery performance.

When a process of the application calls MPI_COMM_REVOKE (similar operations exist for windows and files), all other alive processes in the communicator eventually receive a notification. The MPI_COMM_REVOKE call has an effect on the entire scope of the communicator, without requiring a collective or matching call at any participant. Instead, the effect of the Revoke operation is observed at other processes during non-matching MPI communication calls: when receiving this notification, any communication on the communicator (ongoing or future) is interrupted and a special error code returned. Then, all surviving processes can safely enter the recovery procedure of the application, knowing that no alive process belonging to that communicator will deadlock as a result.

A Resilient, Asynchronous Broadcast

The revocation notification needs to be propagated to all alive processes in the specified communicator, even when new failures happen during the Revoke propagation. Therefore, it is in essence a reliable broadcast. Among the four defining qualities of a reliable broadcast usually considered in the literature (Termination, Validity, Integrity, Agreement) [1], the non-uniform variants of the properties are sufficient, and the integrity criteria can be relaxed in the context of the Revoke algorithm. These simplified requirements are crucial for decreasing the cost of the Revoke operation, as the size of the messages and the number of message exchanges rounds can be drastically increased when one needs to implement an ordered, uniform reliable broadcast. Given the non-uniform agreement, the no-ordering, and loose integrity properties, in the Revoke reliable broadcast, a process that receives its first Revoke message can perform a single round of emissions to all its neighbors, with a constant message size, and then deliver the Revoke notification immediately, without further verification.

The last important aspect is the topology of the overlay network employed to perform the broadcast operation. In the reliable broadcast algorithm, when a process receives a broadcast message for the first time, it immediately broadcasts that same message to all its neighbors in the overlay graph. The agreement property can be guaranteed only when failures do not disconnect the overlay graph. In early prototype versions of the ULFM implementation, the reliable broadcast procedure employed a fully connected network (which guarantees that disconnected cliques never form). Obviously, this method scales poorly as, with the number or processes, the graph degree is linear, and the number of exchanged messages is quadratic. In practice, at scale, the large graph degree resulted in the application aborting due to resource exhaustion (too many open channels simultaneously, not enough memory for unexpected messages, etc.). Therefore, one needs to consider a more scalable overlay topology with a low graph degree that can yet maintain connectivity when nodes are suppressed.

Binomial Graph Overlay Topology

The Binomial Graph (BMG), introduced in [2], is a topology that features both opposing traits of a small degree, yet a strong resistance to the formation of disconnected cliques when nodes fail. A BMG is an undirected graph G=(V,E), where the vertices V represent a set of processes, and the edges E are a set of links forming an overlay network between these processes. Each vertex is given a unique identifier (i.e. the rank of the process). For each vertex v, there is a link to a set of vertices $$W={v+1, v+2, …, v+2^k~|~2^k < n}$$ Intuitively, a binomial graph can be seen as the union of all the binomial trees rooted at all vertices.

The BMG topology is proven to feature several desirable properties. It is a regular graph topology, in which all nodes have the same degree, even in graphs with unremarkable number of vertices. The degree is logarithmic with the number of nodes, therefore scalable. Meanwhile, it retains a small diameter and a small average distance (in number of hops) between any two nodes (also logarithmic). In addition, a binomial broadcast tree rooted at any node can be naturally extracted from a BMG.

Last, the BMG topology has a high node-connectivity —the minimum number of nodes whose removal can result in disconnecting the network—, which is equal to the degree D. As a consequence the BMG is D-1 node fault tolerant in all cases (an optimal result for a graph of degree D). The probability distribution for the formation of a disconnected graph when D or more failures happen is very favorable (the failures have to strike a particular set of nodes, in a variant of the generalized birthday problem).


Because of its asymmetrical nature, the impact of the Revoke call cannot be measured directly. At the initiator, the call only starts a non-synchronizing wave of token circulation, and measuring the very short duration of the call is not representative of the actual time required for the Revoke call to operate at all target processes. Measuring the time needed for a particular operation to be interrupted gives a better estimate of the propagation time of a Revoke notification. However, the overall impact remains underestimated if one doesn’t account for the fact that even after all processes have successfully delivered a Revoke notification, the reliable broadcast algorithm continues to emit and handle Revoke messages in the background for some time.

Download (PDF, 123KB)

The benchmark we designed measures both the duration and the perturbation generated by the progress of a Revoke operation on the network. The benchmark comprises two communication plans. Plan A is a loop that performs a given collective operation on a communicator commA. At some iteration, an initiator process does not match the collective operation, but, instead, Revokes commA, which effectively ends plan A. Plan B is a similar loop performing the same collective operation in a duplicate communicator commB (thus the Revoke operation on commA does not interrupt operations in plan B). The first (and last) operation in plan A gives an estimate of the Revoke propagation time. Operations in plan B estimate the residual perturbation from echoing messages in the reliable broadcast.

Download (PDF, 25KB)

The above figure presents the scalability of the AllReduce collective communication in the Revoke benchmark on the Darter Cray xc30 supercomputer (tuned collective module, uGNI transport). The first observation is that the performance of post-Revoke collective communications follows the same scalability trend as the pre-Revoke operations, even those impacted by jitter. Aside from the first post-Revoke AllReduce communication, which still exhibit a moderate overhead from jitter, the second post-Revoke AllReduce is only mildly impacted and the third AllReduce exhibit no significant difference from the failure free case, illustrating that the jitter introduced by the reliable broadcast algorithm has a low impact on this communication pattern. When the number of processes increases, the impact of jitter —the difference between the failure-free and the first post-Revoke operation— is almost constant (or slightly decreasing). If this trend were to continue at larger scales, the impact of jitter could become asymptotically negligible.

[1] V. Hadzilacos and S. Toueg. Fault-tolerant broadcasts and related problems. In S. Mullender, editor, Distributed systems (2nd Ed.), chapter 5, pages 97–145. ACM/Addison-Wesley, 1993.

[2] T. Angskun, G. Bosilca, and J. Dongarra. Binomial graph: A scalable and fault-tolerant logical network topology. In I. Stojmenovic, R. Thulasiram, L. Yang, W. Jia, M. Guo, and R. de Mello, editors, Parallel and Distributed Processing and Applications, volume 4742 of Lecture Notes in Computer Science, pages 471–482. Springer Berlin Heidelberg, 2007. Continue reading “Logarithmic Revoke Routine”

ANU presents PDE solver with ULFM at IPDPS

Mohsin Ali and Peter Strazdins presented their work on “Application Level Fault Recovery, Using Fault-Tolerant Open MPI in a PDE Solver”, during the IPDPS PDSEC workshop, last week. See the full slides for more details.

This novel work joins the growing list of applications benefiting from ULFM to feature fault tolerance; more examples are presented in these applications slides.

If you have worked on fault tolerant applications with ULFM, or are thinking about doing so, please contact us.

Preparing for June MPI Forum meeting

In preparation for the June MPI forum meeting, the specification has received some updates. The most prominent changes are:

  • The exposed memory in an RMA window may be completely undefined after a failure has occured.
  • MPI_Comm_agree now operates a binary AND on the flag argument.
  • Examples have been corrected to use error classes, instead of error codes, when relevant.

The latest version is available in the ULFM specification area

ULFM Specification update

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item.

This new update adds a new error code to separate process failure errors from non-impacted requests when they remain pending (MPI_ERR_PROC_FAILED_PENDING), and adds new examples.

Head to ULFM Specification for more info.