ULFM User Level Failure Mitigation

Spurious Errors: lack of MPI Progress and Failure Detection

A very common mishap while developing parallel applications is to assume an application running at small scales automatically translates into successful large scale runs. Such optimistic views are largely unproven in general, but tools exists to help with the validation process. However, adding resilience to a parallel application has a tendency to increase the likelihood of consistency errors, and unfortunately no tools to help through this process currently exist. In some cases, few common sense practices could save hours of debugging, and improve the quality of the parallel application.

As you work on you MPI fault tolerant application, you discover it runs fine for small scales and small data sets, but when increasing the number of processes or the computational load on the participating processes, spurious faults seems to be ‘injected’ for no good reason. It might be easy to blame it on the underlying libraries, but before we go there it is possible you are observing an inter-operability issue between the different layers of the resilient software stack. More precisely, you may be observing the effect of the lack of MPI progress on the failure detector within the MPI library.

MPI Progress (and lack thereof)

The MPI Standard does not mandate that asynchronous progress shall always happen. When you post any nonblocking communications, such as MPI_Irecv or MPI_Isend, you may be surprised to discover that, depending on the MPI library configuration and/or the network transport in use, no progress at all may have happened before you enter the corresponding completion function MPI_Wait, or in another MPI function. This is especially true for non-blocking collectives, for which asynchronous progress is a rarity, even with network hardware that would otherwise be capable of asynchronous progress.

The root cause is an optimization that helps obtain good results on raw latency/injection rate (e.g., in performance benchmarks), but is not an optimal choice for applications with long swath of computational code that does not perform any MPI calls (hence, does not trigger the progress engine of MPI). Achieving asynchronous progress for all operations often requires a progress thread, internal to the MPI implementation, which entails that most MPI calls have to be thread-safe, even if the user initializes with MPI_THREAD_SINGLE. It is also more difficult to implement. Hence, most MPI implementations defaults to an ‘user-call-progressed’ mode, that is, message transmission and reception management are both progressed when the user enters one of the ‘progressing’ MPI routines (e.g., MPI_Wait, MPI_Test, MPI_Recv, etc.) Upon, entering such a routine, any MPI operation currently ongoing may progress, including operations that are not referenced by the call.

For the same reasons that asynchronous progress is not the default behavior in Open MPI (latency performance optimization), the failure detector in ULFM is also call-progress based.

An example

Let’s look at the following code example to illustrate the progress issue.


static void compute(int how_long);

int main(int argc, char* argv[]) {
    /* Send some large data to make the recv long enough to see something interesting */
    int count = 4*1024*1024;
    int* send_buff = malloc(count * sizeof(*send_buff));
    int* recv_buff = malloc(count * sizeof(*recv_buff));

    /* Every process sends a token to the right on a ring (size tokens sent per
     * iteration, one per process per iteration) */
    int left = MPI_PROC_NULL, right = MPI_PROC_NULL;
    int rank = MPI_PROC_NULL, size = 0;
    int tag = 42, iterations = 10, how_long = 5;
    MPI_Request req = MPI_REQUEST_NULL;
    double post_date, wait_date, complete_date;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    left = (rank+size-1)%size; /* left ring neighbor in circular space */
    right = (rank+size+1)%size; /* right ring neighbor in circular space */

/**** Done with initialization, main loop now ************************/
    printf("Entering test; computing silently for %d secondsn", how_long);
    for(int i = 0; i < iterations; i++) {
      post_date = MPI_Wtime();
      MPI_Irecv(recv_buff, 512, MPI_INT, left, tag, MPI_COMM_WORLD, &req);
      MPI_Send (send_buff, 512, MPI_INT, right, tag, MPI_COMM_WORLD);
      wait_date = MPI_Wtime();
      MPI_Wait(&req, MPI_STATUS_IGNORE);
      complete_date = MPI_Wtime();
      printf("Worked %g seconds before entering MPI_Wait and spend %g seconds waitingn",
             wait_date - post_date,
             complete_date - wait_date);

    return 0;


static void compute(int how_long) {
    /* no real computation, just waiting doing nothing */

Without Fault Tolerance

The program will report how much of the time was overlapped between the compute operation and the reception of the large message. Note that this is the easy case: a contiguous receive. Hence, most high performance networks will achieve overlap. If we run on one of the networks that does not perform asynchronous progress, we may still observe that the majority (or the totality) of the communication time is visible in the duration of the MPI_Wait function. In essence, the lack of progress can cause disappointing performance with respect to communication/computation overlap.

With Fault Tolerance

The program will produce the following output:

mpirun -mca mpi_ft_detector_timeout 1 -np 10 sleeptest
Entering test; computing silently for 5 seconds
[saturn:62859] *** An error occurred in MPI_Send
[saturn:62859] *** reported by process [3748724737,1]
[saturn:62859] *** on communicator MPI_COMM_WORLD
[saturn:62859] *** MPI_ERR_PROC_FAILED: Process Failure
[saturn:62859] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[saturn:62859] *** and potentially your MPI job)

Our test program has aborted, because the MPI library has reported that a process has failed. This is manifestly a spurious error that is raised despite the fact that no process has failed. The root cause lies in the fact that, since the application stopped performing any MPI calls for longer than the detection timeout, there was no opportunity for progressing the MPI library, either to progress asynchronous communications (the aforementioned performance issue), or to progress the internal failure detector. Due to lack of progress, the dates at which heartbeat messages are sent and their reception managed can drift from the expectations set in the failure detector, and that drift can grow to make a process appear unresponsive for long enough to be reported as dead (hence resulting in the occurrence of a spurious error).

Tuning the Detector for your Use Case

Most communication intensive applications will do just well with the default settings for the failure detector. These settings favor latency and injection rate metrics by setting some sane defaults that will avoid triggering spurious faults in most communicating applications, while minimizing the performance impact on the most common communication patterns. It is however important to note that when spurious errors do get reported, it may be a simple matter of tuning the runtime parameters, or your application code to fix the error.

Solution 1: tune the detector

The detector in ULFM is highly configurable. You can change both the timeout, and the use of an internal progress thread within the MPI library that will ensure that the detector remains active when the application does not perform MPI progress. You can set the following mpirun parameters to control the detector:

  • -mca mpi_ft_detector_timeout number: number of seconds before the detector reports an error from an unresponsive process (floating point value). If you experience spurious errors, this can be a first step at weeding them out. The advantage of this approach is that it will leave micro-benchmarks performance unchanged; however, the timeout for detection of actual failures will be increased. Given that failures are somewhat rare, this can often be a beneficial trade-off.
  • -mca mpi_ft_detector_thread [true/false]: defaults to false (to preserve micro-benchmark performance). When set to true, the detector executes in a MPI library internal thread, thus avoiding the dependence on user progress. The performance trade-off comes from the fact that this flag will act as-if MPI_Init had been called with MPI_THREAD_MULTIPLE, which has a mild impact on latency/injection rate. This is often a good trade-off for real applications.
  • -mca mpi_ft_detector [true/false]: can be used to turn off the detector altogether. This approach can work with some networks that provide in-band error reporting (e.g., TCP), but the failure detection timeout is bounded by external parameters (TCP timeouts can reach into hours under certain circumstances), and some networks do not provide in-band detection, which can result in deadlocks in failure scenarios.

Solution 2: progress MPI

A different approach is to dedicate some resources to progressing MPI operations. A traditional way for achieving continuous progress from the user-code is to initialize MPI with MPI_THREAD_MULTIPLE, and dedicate a thread to MPI progress by parking the thread in an MPI_Recv for the entire duration of the program. This receive should never be matched by a send, except at the very end of the program, in order to release the progress thread and let it terminate.

This technique lets the user dedicate one of her own threads to ensure that the MPI progress engine is active at all times, and will avoid both spurious errors, as well as often improve performance in applications that are sensitive to communication/computation overlap (especially when using complex communication routines, like non-contiguous datatypes and non-blocking collective operations, which cannot be easily delegated to the network hardware). Note however that the raw latency/bandwidth in micro-benchmarks can be impacted, and one has to dedicate a CPU core to this background activity.


The lack of calls progressing the MPI library is a known adverse situation in HPC applications. In the fault-free world, this can translate into the application significantly under-performing when users expect communication/computation overlap. This issue can compound during fault-tolerant operation by causing drifts in the failure detector, resulting in spurious error reporting (i.e., the notification of failures about processes that have not failed). Fortunately, the ULFM implementation provides multiple parameters that permit circumventing this issue.

User Level Failure Mitigation

Running on Edison

Edison is a Cray XC30, with a peak performance of 2.57 petaflops/sec, 133,824 compute cores, 357 terabytes of memory, and 7.56 petabytes of disk. Hosted at NERSC, it offers access to many researchers and practitioners in the US.

Installing the latest versions of Open MPI (including the development, unstable and stable) on this machine can easily be done by using the provided platform file. The same cannot unfortunately be said about the current ULFM version. This version was based on an older unstable version of Open MPI (1.7) and didnt backported all the new exciting features from the development branch of Open MPI. As a result, getting it to work on Edison is a little challenging, but we all like a little challenge.

Preparing for installation

This step is unfortunately required, as I couldnt find any other way to update the m4 file of our ancient ULFM installation, without dedicating too much time to such a hopeless task (especially as the new ULFM 2.0 is reaching almost stable status). The issue is triggered by a misconfiguration of autoconf 2.69 on Edison, which breaks all the autotool chain. The issue can be easily corrected by downloading the Mercurial version of ULFM on another computer, running, a quick configure with no arguments, followed by “make dist”. This will generate 2 archives (openmpi-1.7.1ULFM.tar.bz2 and openmpi-1.7.1ULFM.tar.gz). Use the one corresponding to your preferred compression algorithm, and move it on Edison.

Compiling from source

With the archive moved on Edison, you can now undergo the last step to configure and compile ULFM there. Lets assume we want to build an optimized version, integrated with Edison resource management software (Slurm), using uGNI, without debugging information. One of the interesting features ULFM inherited from Open MPI is the capability of using platform files. Here is the platform file for Edison, it will eventually make its way into the official ULFM release, meanwhile you can copy and paste directly from here.


With this platform file saved as platform.edisonyou can now proceed to configure ULFM using “./configure —with-platform=platform.edison …”. There are many other possible configuration parameters, and you are strongly encouraged to read the output of configure —helpto gain more insight into the flexible configuration system. Meanwhile here is what I usually add in addition to the platform file “—prefix=… –enable-mpirun-prefix-by-default” (please replace the dots with your prefered installation directory). Keep in mind that the installation directory must be in your $PATH, to have direct access to the wrapper compilers and to the mpiexec.

Preparing for execution

In general we suggest you to use a specific MCA file (the -am parameter) to provide the parameters for your ULFM runs (and not collide with the normal Open MPI runs). As long as you use mpiexec to start your application this will work. If instead you want to rely on srun to so do, you will have to dump the context of your AMCA file directly into your main ${HOME}/.openmpi/mca-params.conf file (and pay attention when you execute normal Open MPI runs).

Running ULFM applications in an interactive job

Congratulations, the hardest steps are now done and you are now supposed to have a fully compiled and ready to run version of ULFM, and its binaries in your $PATH. Allocating an interactive job on Edison is well described in the online documentation. There is however a trick, you need to add an option to prevent Slurm from destroying your job when one of the processes disappear. This option is —no-kill.

Thus, allocating an interactive job with 4 processes for a short duration can be achieved with “`salloc -p debug -N 4 –no-kill“.
Running an application then become easy, as indicated in ULFM setup steps.

Survival readiness

You code has now a sane execution environment that is able to survive faults, and can help your application achieve the same goal.

User Level Failure Mitigation

Uniform Intercomm Creation

A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function.

Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6):

This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag.

In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like

So far so good, but what “uniformly” means? Based on some web resources uniformly is about consistency, in this particular instance about the fact that all participants will return the same answer (aka. the some inter-communicator or the same error code) despite all odds. Here we are looking specifically from the point of view of process failures, more precisely a single failure. Obviously, the trivial approach where one agrees on the resulting intercomm doesn’t work, as due to process failures some processes might not have an intercomm. Clearly, we need a different, better approach.

While talking about this, Aurelien proposed the following solution, entirely based on MPI_Barrier.

rc = MPI_Intercomm_create(local, ..., peer_comm, ..., &inter);
interexists = (MPI_SUCCESS == rc);
rc = MPI_Barrier(local);
if( interexists && (MPI_SUCCESS != rc) ) {
    interexists = 0;
if( interexists ) {
    rc = MPI_Barrier(inter);
    if( MPI_ERR_REVOKED != rc ) {
        return MPI_SUCCESS;

Looks quite simple and straightforward. Does it do reach the expected consensus at the end? Unfortunately not, it doesn’t work in all cases. Imagine that all peers in one side succeeded to create the intercomm while everyone in the remote communicator failed. Then on one side the interexists is TRUE while on the other side it is FALSE. Thus the local barrier (line 3) succeed on the side with interexists equal to TRUE and the intercomm is not revoked. As the intercomm exists, these processes will then go in the second MPI_Barrier (line 10), on the “half-created” intercomm, and will deadlock as there will never be a remote process to participate to the barrier.

What we are missing here is an agreement between the two sides (A and B) that things went well for everyone. So let’s try to use agreement to reach this goal. In the remaining of this post I will use the function AGREE, which is a point-to-point agreement between two processes (the two leaders). We need this particular function, as we cannot call MPI_Comm_agree on the bridge communicator, simply because the two leaders are not the only participants in the bridge communicator. Fortunately, implementing an agreement between two processes is almost a trivial task, so I will consider the AGREE function as existing.

rc = MPI_Intercomm_create(local, ..., peer_comm, ..., &intercomm);
flag = (MPI_SUCCESS == rc);
MPI_Comm_agree(local, &flag);
if( i_am_leader) {
    rc = AGREE(bridge, &flag);
    flag &= (rc != MPI_SUCCESS);
rc = MPI_Comm_agree(local, &flag);
flag &= (MPI_SUCCESS == rc);

If there is no failure the code works, but it’s an overkill. Now, if we have one failure during the execution there are several cases.
1. The failure happens before the MPI_Intercomm_create. Some processes will have an intercomm, and some will not. Thus the local agreement will make everyone on the same group (A and B) agree about the local existence of the intercomm. The AGREE between the leaders will then propagate this knowledge to the remote group leader. At this point the two leaders have the correct answer, and then the MPI_Comm_agree at line 8 will finally distribute this answer on the two groups. This case works!
2. The failure happen after the MPI_Intercomm_create. One can think about this case as a false positive (because the intercomm exists on all alive processes), but the real question is if we are able to detect it and reach a consensus. The MPI_Comm_agree at line 3 will make the flag TRUE on both groups. Note that on one side the MPI_Comm_agree will return MPI_ERR_PROC_FAILED as required by the MPI standard, but we do not take in account the return code here.

rc = MPI_Intercomm_create(local, ..., peer_comm, ..., &intercomm);
flag = (MPI_SUCCESS == rc);
MPI_Comm_agree(local, &flag);
if( i_am_leader) {
    rc = AGREE(bridge, &flag);
    flag &= (rc != MPI_SUCCESS);
    if( !flag ) MPI_Comm_revoke(intercomm);
rc = MPI_Comm_agree(local, &flag);
flag &= (MPI_SUCCESS == rc);
if( !flag ) MPI_Comm_revoke(intercomm);

Here is the reasoning leading to this form of the algorithm.
1. The MPI_Comm_agree on line 3 is a little tricky. We use the returned value of the flag (bitwise AND operation between all living processes), but we willingly choose to ignore the return value. What really matters is that all alive participants have the resulting intercomm, and this will be reflected in the value of flag after the call. All other cases are trapped either by partially existing intercomm, or by the REVOKE call. Moreover, it is extremely important that the resulting value of flag is computed through an agreement and not through a traditional collective communication (which lack the consistency factor).
2. For a similar reason the MPI_Comm_agree on line 9 cannot be replaced by a MPI_Broadcast, even as its meaning is to broadcast the information collected at the local root into the local group of processes. Without the use of MPI_Comm_agree here one cannot distinguish between every process still alive has an intercomm with known faults or only some participants have an intercomm (with faults). Thus, as the intercomm can be revoked only if all alive processes know about it, we have to enforce this consistent view by the use of the agreement.

While this inter-communicator creation looks like an extremely complicated and expensive matter, one should keep in mind that such communicator creation are not supposed to be in the critical path, so they might not be as prohibitive as one might think. If Everything Was Easy Nothing Would Be Worth It!

User Level Failure Mitigation

Slides with ULFM examples

Some new slides with ULFM examples are now available.