ULFM Setup

Building

Step 0:
Since you are building from a source tarball you will need to autogen the build system. You may need to install/update some support packages as indicated by the script output.

Step 1:
Configure with at least the following options.

The following configure line is what I would generally suggest. It is a debug build, so if things go wrong it can be used to figure out what may have occurred.

I have also tested an optimized build with the following parameters:

Step 2:
Now to build Open MPI (you can use ‘-j4′ to do a parallel build if you like):

Step 3:
Now install Open MPI (you can use ‘-j4′ to do a parallel build if you like):

Step 4:
Add Open MPI to your environment (make sure there are no other MPI libraries in
your PATH or LD_LIBRARY_PATH):


Running

Source Code Modifications:

You will need to add one additional header to get access to the new functions, just after the include for mpi.h:

For C/C++ this looks like:

For F77 this looks like:

For F90 this looks like:

All of the new functions (e.g., MPI_Comm_agree) are prefixed with OMPI instead of MPI. So MPI_Comm_agree would be OMPI_Comm_agree. This namespace change is required to differentiate between standard interfaces (i.e., MPI) and non-standard interfaces (i.e., OMPI). Note that this only applies to the new interfaces, current standard interfaces remain the same (e.g., MPI_Send).

For a complete list of available interfaces you can either look in your install directory under:

or in the source tree under:

Building:

You will build your MPI applications as normal (e.g., mpicc).

Running:

At runtime you will just need to add one additional flag to the command line (-am ft-enable-mpi). This command line option activates the fault tolerance features of Open MPI:


Testing

The following example will help you test your environment.

In this example, an error handler is set on MPI_COMM_WORLD. MPI_COMM_WORLD is then split into two different communicators. One sub-communicator uses the error handler and the other uses the default error handler of MPI_ERRORS_ARE_FATAL. All processes (except the one that is calling MPI_Abort) wait for the appropriate number of calls to the error handler before calling MPI_Finalize. One process (namely the last rank in MPI_COMM_WORLD) calls abort on the sub-communicator that is using the default error handler. The MPI_Abort call will terminate all processes in the sub-communicator, and notify all other processes through the error handler registered on the communicator.

This example should run to completion and exit with a return value of 0.

Source:
abort_subcomm_fatal.c

The following is an example of the expected output from a run of this example program:

If you would like other examples, or have any problems let us know.