Mohsin Ali and Peter Strazdins presented their work on “Application Level Fault Recovery, Using Fault-Tolerant Open MPI in a PDE Solver”, during the IPDPS PDSEC workshop, last week. See the full slides for more details. This novel work joins the growing list of applications benefiting from ULFM to feature fault tolerance; more examples are presented in these applications slides. If you have worked on fault tolerant applications with ULFM, or are thinking about doing so, please contact us.
To better accommodate management of the ULFM repository, the URL has changed. It can now be found at: https://bitbucket.org/icldistcomp/ulfm/ All of your previous SSH keys will continue to work. All you need to do is change your .hg/hgrc file to point to the new repository.
To clarify the difference between installation/setup and usage, the old Usage Guide has been moved to ULFM Setup and a new Usage Guide has been put in place to provide instruction and examples for using ULFM constructs in MPI code. For now, this example section provides the code outlined in the ULFM specification, but this will eventually be amended to include more complete and unique examples. You can find the both of these pages in the menu bar, under User Level Failure Mitigation.
We’ve now established a mailing list related to ULFM for all user questions, bug reports, etc. To send an email to the list, you will need to subscribe first by sending an email to firstname.lastname@example.org. Then you can send emails to email@example.com.