Usage Guide

Basic Requirements

First and foremost, to use this ULFM implementation, an MPI application must change the default error handler on (at least) MPI_COMM_WORLD from MPI_ERRORS_ARE_FATAL to either MPI_ERRORS_RETURN or a custom MPI Errhandler.

Examples

Master/Worker

The example below presents a master code that handles failures by ignoring failed processes and resubmitting requests. It demonstrates the di fferent failure cases that may occur when posting receptions from MPI_ANY_SOURCE as discussed in the advice to users in the proposal, Section 17.2.2.

master.c

Iterative Refinement

The example below demonstrates a method of fault-tolerance to detect and handle failures. At each iteration, the algorithm checks the return code of the MPI_ALLREDUCE. If the return code indicates a process failure for at least one process, the algorithm revokes the communicator, agrees on the presence of failures, and later shrinks it to create a new communicator. By calling MPI_COMM_REVOKE, the algorithm ensures that all processes will be noti ed of process failure and enter the MPI_COMM_AGREE. If a process fails, the algorithm must complete at least one more iteration to ensure a correct answer.

iterative_refinement.c


These examples will be expanded to complete, compilable code along with new examples soon.

Leave a Reply