[crayon-5f2bb58619bbc293120535inline="true"]<![CDATA[Reliability isone of the major concerns when envisioning the future Exascale platforms.The<atitle="IESP"href="http://www.exascale.org"target="_blank"rel="noopener noreferrer">IESP</a>projects an increase innode performance andnode concurrency by one ortwo orders of magnitude,which translates,even under the most optimistic perspectives,inamechanical decrease of the mean time tointerruption(MTTI)of at least one order of magnitude.Because of thistendency,platform providers,software implementors,andhigh-performance application users who target capability runs on such machines cannot regard the occurrence of interruption due toafailure asarare dramatic event,but must consider faults inevitable andtake them into account by integrating some form of fault-tolerance.
One easy way to get ready is to join us at SC’14 in New Orleans for a tutorial on fault tolerance, a middle-ground between theoretical understanding and practical knowledge. This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a fault tolerant MPI extension recently proposed to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and ABFT techniques in a hands-on session.
In preparation for the hand-on session one needs to get ready by installing ULFM and setting up the paths to access it. You can either follow the post or the steps below.
1. Download the version prepared for this tutorial.
2. Untar it in some convenient location.
tar jxvf ulfm-4419d3f7cee3.tbz
Go inside the newly untarred directory and launch
Configure Open MPI. You should change the –prefix in the following command, and make sure that –enable-mpi-ext=ftmpi –with-ft=mpi is specified.
A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function.
Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6):
This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag.
In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like
So far so good, but what “uniformly” means? Based on some web resources uniformly is about consistency, in this particular instance about the fact that all participants will return the same answer (aka. the some inter-communicator or the same error code) despite all odds. Here we are looking specifically from the point of view of process failures, more precisely a single failure. Obviously, the trivial approach where one agrees on the resulting intercomm doesn’t work, as due to process failures some processes might not have an intercomm. Clearly, we need a different, better approach.
While talking about this, Aurelien proposed the following solution, entirely based on MPI_Barrier.
Looks quite simple and straightforward. Does it do reach the expected consensus at the end? Unfortunately not, it doesn’t work in all cases. Imagine that all peers in one side succeeded to create the intercomm while everyone in the remote communicator failed. Then on one side the interexists is TRUE while on the other side it is FALSE. Thus the local barrier (line 3) succeed on the side with interexists equal to TRUE and the intercomm is not revoked. As the intercomm exists, these processes will then go in the second MPI_Barrier (line 10), on the “half-created” intercomm, and will deadlock as there will never be a remote process to participate to the barrier.
What we are missing here is an agreement between the two sides (A and B) that things went well for everyone. So let’s try to use agreement to reach this goal. In the remaining of this post I will use the function AGREE, which is a point-to-point agreement between two processes (the two leaders). We need this particular function, as we cannot call MPI_Comm_agree on the bridge communicator, simply because the two leaders are not the only participants in the bridge communicator. Fortunately, implementing an agreement between two processes is almost a trivial task, so I will consider the AGREE function as existing.
If there is no failure the code works, but it’s an overkill. Now, if we have one failure during the execution there are several cases.
1. The failure happens before the MPI_Intercomm_create. Some processes will have an intercomm, and some will not. Thus the local agreement will make everyone on the same group (A and B) agree about the local existence of the intercomm. The AGREE between the leaders will then propagate this knowledge to the remote group leader. At this point the two leaders have the correct answer, and then the MPI_Comm_agree at line 8 will finally distribute this answer on the two groups. This case works!
2. The failure happen after the MPI_Intercomm_create. One can think about this case as a false positive (because the intercomm exists on all alive processes), but the real question is if we are able to detect it and reach a consensus. The MPI_Comm_agree at line 3 will make the flag TRUE on both groups. Note that on one side the MPI_Comm_agree will return MPI_ERR_PROC_FAILED as required by the MPI standard, but we do not take in account the return code here.
Here is the reasoning leading to this form of the algorithm.
1. The MPI_Comm_agree on line 3 is a little tricky. We use the returned value of the flag (bitwise AND operation between all living processes), but we willingly choose to ignore the return value. What really matters is that all alive participants have the resulting intercomm, and this will be reflected in the value of flag after the call. All other cases are trapped either by partially existing intercomm, or by the REVOKE call. Moreover, it is extremely important that the resulting value of flag is computed through an agreement and not through a traditional collective communication (which lack the consistency factor).
2. For a similar reason the MPI_Comm_agree on line 9 cannot be replaced by a MPI_Broadcast, even as its meaning is to broadcast the information collected at the local root into the local group of processes. Without the use of MPI_Comm_agree here one cannot distinguish between every process still alive has an intercomm with known faults or only some participants have an intercomm (with faults). Thus, as the intercomm can be revoked only if all alive processes know about it, we have to enforce this consistent view by the use of the agreement.
While this inter-communicator creation looks like an extremely complicated and expensive matter, one should keep in mind that such communicator creation are not supposed to be in the critical path, so they might not be as prohibitive as one might think. If Everything Was Easy Nothing Would Be Worth It!
A new version of the ULFM specification based on the upcoming MPI 3.1 and the discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item. Head to ULFM Specification for more info.
To clarify the difference between installation/setup and usage, the old Usage Guide has been moved to ULFM Setup and a new Usage Guide has been put in place to provide instruction and examples for using ULFM constructs in MPI code. For now, this example section provides the code outlined in the ULFM specification, but this will eventually be amended to include more complete and unique examples. You can find the both of these pages in the menu bar, under User Level Failure Mitigation.
The User Level Failure Mitigation team will be at this year’s Supercomputing conference in Salt Lake City. Come visit us at The University of Tennessee booth (#3010) to hear more about our work as well as lots of other interesting research going on at UTK.
We’ve now established a mailing list related to ULFM for all user questions, bug reports, etc. To send an email to the list, you will need to subscribe first by sending an email to email@example.com. Then you can send emails to firstname.lastname@example.org.