ULFM-1.1 Release

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers. Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING. Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations. Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx. Support for Intercommunicators: Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators. MPI_COMM_REVOKE tested on intercommunicators. Disabled completely (.ompi_ignore) many untested components Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms. Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate. Bugfixes: SendRecv would not always report MPI_ERR_PROC_FAILED correctly. SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv. Revoked send operations are now always completed or remote cancelled and may not deadlock anymore. Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure. Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator. Get the source and happy hacking, Continue reading ULFM-1.1 Release

SC’15 tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. Get the slides part1, part2, and the examples More information about the tutorial can be found here. Enjoy our promotional video šŸ˜‰ See you all in Austin, TX !!!

Uniform Intercomm Creation

A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function. Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6): This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag. In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like So far so good, but what “uniformly” means? Based on some Continue reading Uniform Intercomm Creation

New Usage Guide

To clarify the difference between installation/setup and usage, the old Usage Guide has been moved to ULFM Setup and a new Usage Guide has been put in place to provide instruction and examples for using ULFM constructs in MPI code. For now, this example section provides the code outlined in the ULFM specification, but this will eventually beĀ amendedĀ to include more complete and unique examples. You can find the both of these pages in the menu bar, under User Level Failure Mitigation.