The major 1.0 milestone has been reached for the User Level Failure Mitigation compliant fault tolerant MPI.
We have focused on improving performance, both before and after the occurence of failures. The list of new features includes:
- Support for the non-blocking version of the agreement, MPI_COMM_IAGREE.
- Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed.
- New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK. Meet us at SC’15 to learn more about the novel algorithm we designed!
- New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks. Meet us at EuroMPI’15 to learn more about the Revoke algorithm we designed!
- Improved support for our traditional network layer:
- TCP: fully tested
- SM: fully tested (with the exception of XPMEM, which remains unsupported)
- Added support for High Performance networks
- Open IB: reasonably tested
- uGNI: reasonably tested
- The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting
- Back-ported PBS/ALPS fixes from Open MPI
- Back-ported OpenIB bug/performance fixes from Open MPI
- Improve Context ID allocation algorithm to reduce overheads of Shrink
- Miscellaneous bug fixes (look at the commit log for the full list).
Fault tolerance support for RMA and IO is still under development.
Get the source and happy hacking,
The ULFM team