ULFM 4.0.2u1

The ICL ULFM team is happy to announce ULFM v4.0.2u1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.2). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.2 read the changelog. This is a stability and upstream parity upgrade, moving ULFM from version 4.0.1 of Open MPI to the latest stable (v4.0.2, October 2019 #cb5f4e737a, ulfm #0e249ca1). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI. Features This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM Continue reading ULFM 4.0.2u1

ULFM 2.1rc1 Open MPI v4.0.1 based

The ICL ULFM team is happy to announce ULFM v4.0.1ulfm2.1rc1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.1). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.1 read the changelog. This is a stability and upstream parity upgrade, moving ULFM from an old unreleased version of Open MPI to the latest stable (v4.0.1, May 2019 #b780667). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI. Features This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM Continue reading ULFM 2.1rc1 Open MPI v4.0.1 based

ULFM-1.1 Release

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers. Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING. Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations. Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx. Support for Intercommunicators: Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators. MPI_COMM_REVOKE tested on intercommunicators. Disabled completely (.ompi_ignore) many untested components Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms. Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate. Bugfixes: SendRecv would not always report MPI_ERR_PROC_FAILED correctly. SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv. Revoked send operations are now always completed or remote cancelled and may not deadlock anymore. Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure. Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator. Get the source and happy hacking, Continue reading ULFM-1.1 Release

ULFM 1.0 Announced

The major 1.0 milestone has been reached for the User Level Failure Mitigation compliant fault tolerant MPI. We have focused on improving performance, both before and after the occurence of failures. The list of new features includes: Support for the non-blocking version of the agreement, MPI_COMM_IAGREE. Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed. New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK. Meet us at SC’15 to  learn more about the novel algorithm we designed! New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks. Meet us at EuroMPI’15 to learn more about the Revoke algorithm we designed! Improved support for our traditional network layer: TCP: fully tested SM: fully tested (with the exception of XPMEM, which remains unsupported) Added support for High Performance networks Open IB: reasonably tested uGNI: reasonably tested The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting Back-ported PBS/ALPS fixes from Open MPI Continue reading ULFM 1.0 Announced

ULFM Beta 2

The second beta for the ULFM implementation in Open MPI has been posted. This is a minor update to fix agreement operations. The changelog is relatively small and can be found in the Release Notes section of ULFM. The tarballs can be found in the Downloads section and instructions for use can be found in the Usage Guide.

ULFM Beta 1

The first beta for the ULFM implementation in Open MPI has been posted. This is the first public release of the User Level Failure Mitigation implementation. The changelog can be found in the Release Notes section of ULFM. The tarballs can be found in the Downloads section and instructions for use can be found in the Usage Guide.