Category Archives: User Level Failure Mitigation
Uniform Intercomm Creation
A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function. Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6): This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag. In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like So far so good, but what “uniformly” means? Based on some Continue reading Uniform Intercomm Creation
ANU presents PDE solver with ULFM at IPDPS
Mohsin Ali and Peter Strazdins presented their work on “Application Level Fault Recovery, Using Fault-Tolerant Open MPI in a PDE Solver”, during the IPDPS PDSEC workshop, last week. See the full slides for more details. This novel work joins the growing list of applications benefiting from ULFM to feature fault tolerance; more examples are presented in these applications slides. If you have worked on fault tolerant applications with ULFM, or are thinking about doing so, please contact us.
Preparing for June MPI Forum meeting
In preparation for the June MPI forum meeting, the specification has received some updates. The most prominent changes are: The exposed memory in an RMA window may be completely undefined after a failure has occured. MPI_Comm_agree now operates a binary AND on the flag argument. Examples have been corrected to use error classes, instead of error codes, when relevant. The latest version is available in the ULFM specification area
March MPI Forum Meeting
A full presentation of ULFM has been made in front of the MPI Forum, during the march San Jose meeting, where we received good feedback. We also presented a deck of slides summarizing users’ success stories with ULFM.
Slides with ULFM examples
Some new slides with ULFM examples are now available. ULFM-EXAMPLES_SLIDES-MPI-Dec13FORUM
ULFM Specification update
A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item. This new update adds a new error code to separate process failure errors from non-impacted requests when they remain pending (MPI_ERR_PROC_FAILED_PENDING), and adds new examples. Head to ULFM Specification for more info.
ULFM Specification update
A new version of the ULFM specification based on the upcoming MPI 3.1 and the discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item. Head to ULFM Specification for more info.
Flyer for ULFM at SC’13
An flyer has been created for SC’13. It puts further emphasis on use-cases and features updated graphs showcasing performance while failures are being digested by the system. Download the flyer
ULFM Specification Update
A new version of the ULFM specification based on the upcoming MPI 3.1 has been posted under the ULFM Specification item. Head to ULFM Specification for more info.