Categories
User Level Failure Mitigation

ULFM-1.1 Release

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers.

Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING.

  • Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations.
  • Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx.
  • Support for Intercommunicators:
    • Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators.
    • MPI_COMM_REVOKE tested on intercommunicators.
  • Disabled completely (.ompi_ignore) many untested components
  • Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms.
  • Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate.
  • Bugfixes:
    • SendRecv would not always report MPI_ERR_PROC_FAILED correctly.
    • SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv.
  • Revoked send operations are now always completed or remote cancelled and may not deadlock anymore.
  • Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure.
  • Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator.

Get the source and happy hacking,
The ULFM team