George Bosilca

ULFM-1.1 Release

Posted on November 14, 2015 by George Bosilca

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers. Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING. Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations. Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx. Support for Intercommunicators: Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators. MPI_COMM_REVOKE tested on intercommunicators. Disabled completely (.ompi_ignore) many untested components Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms. Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate. Bugfixes: SendRecv would not always report MPI_ERR_PROC_FAILED correctly. SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv. Revoked send operations are now always completed or remote cancelled and may not deadlock anymore. Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure. Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator. Get the source and happy hacking, Continue reading ULFM-1.1 Release→

SC’15 tutorial

Posted on October 2, 2015 by George Bosilca

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. Get the slides part1, part2, and the examples More information about the tutorial can be found here. Enjoy our promotional video 😉 See you all in Austin, TX !!!

Tutorial @ SC'14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE

Posted on November 15, 2014 by George Bosilca

Continue reading Tutorial @ SC'14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE→

Uniform Intercomm Creation

Posted on July 10, 2014 by George Bosilca

A question about uniformly creating an inter-communicator using MPI_Intercomm_create has been posted on the ULFM mailing list. Initially, I though it is an easy corner-case, that can be solved with few barriers and/or agreements. It turns out this issue is more complicated that initially expected, with few twists on the way. Let me detail our adventure toward writing a uniform intercomm creation function. Before moving further, let’s clarify what MPI_Intercomm_create is about. The MPI standard is not very explicit about the scope of this function, but we can gather enough info to start talking about (page 262 line 6): This call [MPI_Intercomm_create] creates an inter-communicator. It is collective over the union of the local and remote groups. Processes should provide identical local_comm and local_leader arguments within each group. Wildcards are not permitted for remote_leader, local_leader, and tag. In other words, if you provide two intra-communicators, a leader on each one and a bridge communicator where the leaders can talk together, you will be able to bind the two groups of processes corresponding to each of the intra-communicators into a inter-communicator. Neat! Graphically speaking this should look like So far so good, but what “uniformly” means? Based on some Continue reading Uniform Intercomm Creation→

ULFM Specification update

Posted on December 10, 2013 by George Bosilca

A new version of the ULFM specification based on the upcoming MPI 3.1 and the discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item. Head to ULFM Specification for more info.

ULFM Specification Update

Posted on August 15, 2013 by George Bosilca

A new version of the ULFM specification based on the upcoming MPI 3.1 has been posted under the ULFM Specification item. Head to ULFM Specification for more info.

ULFM Repository Change

Posted on January 2, 2013 by George Bosilca — No Comments ↓

To better accommodate management of the ULFM repository, the URL has changed. It can now be found at: https://bitbucket.org/icldistcomp/ulfm/ All of your previous SSH keys will continue to work. All you need to do is change your .hg/hgrc file to point to the new repository.

New Usage Guide

Posted on November 27, 2012 by George Bosilca — No Comments ↓

To clarify the difference between installation/setup and usage, the old Usage Guide has been moved to ULFM Setup and a new Usage Guide has been put in place to provide instruction and examples for using ULFM constructs in MPI code. For now, this example section provides the code outlined in the ULFM specification, but this will eventually be amended to include more complete and unique examples. You can find the both of these pages in the menu bar, under User Level Failure Mitigation.

ULFM at SC 12

Posted on November 8, 2012 by George Bosilca — No Comments ↓

The User Level Failure Mitigation team will be at this year’s Supercomputing conference in Salt Lake City. Come visit us at The University of Tennessee booth (#3010) to hear more about our work as well as lots of other interesting research going on at UTK.

ULFM Flyer

Posted on November 8, 2012 by George Bosilca — No Comments ↓

We have a flyer for User Level Failure Mitigation to show new results and design. It can be found at this link: SC12 ULFM Flyer.