ULFM Setup

Building

Step 0:
Since you are building from a source tarball you will need to autogen the build system. You may need to install/update some support packages as indicated by the script output.

shell$ ./autogen.pl

Step 1:
Configure with at least the following options.

--enable-mpi-ext=ftmpi --with-ft=mpi

The following configure line is what I would generally suggest. It is a debug build, so if things go wrong it can be used to figure out what may have occurred.

shell$ ./configure --prefix=/path/to/install \
       --enable-mpi-ext=ftmpi --with-ft=mpi \
       --disable-io-romio --enable-contrib-no-build=vt \
       --with-devel-headers --enable-binaries \
       --enable-debug \
       CC=gcc CXX=g++ F77=gfortran FC=gfortran

I have also tested an optimized build with the following parameters:

shell$ ./configure --prefix=/path/to/install \
       --enable-mpi-ext=ftmpi --with-ft=mpi \
       --disable-io-romio --enable-contrib-no-build=vt \
       --with-platform=optimized \
       CC=gcc CXX=g++ F77=gfortran FC=gfortran

Step 2:
Now to build Open MPI (you can use ‘-j4′ to do a parallel build if you like):

shell$ make

Step 3:
Now install Open MPI (you can use ‘-j4′ to do a parallel build if you like):

shell$ make install

Step 4:
Add Open MPI to your environment (make sure there are no other MPI libraries in
your PATH or LD_LIBRARY_PATH):

shell$ export PATH=$HOME/open-mpi-install-dir/bin:$PATH
shell$ export LD_LIBRARY_PATH=$HOME/open-mpi-install-dir/lib:$LD_LIBRARY_PATH

Running

Source Code Modifications:

You will need to add one additional header to get access to the new functions, just after the include for mpi.h:

For C/C++ this looks like:

#include <mpi.h>
#include <mpi-ext.h>

For F77 this looks like:

program main
implicit none
include 'mpif.h'
include 'mpif-ext.h'

For F90 this looks like:

program main
    use mpi
!   Ideally you would use the module, but that is not ready yet
!   instead just use the F77 header below.
!   use mpi_ext

    implicit none
!   Use the F77 Header until the full F90 module is ready
    include 'mpif-ext.h'

All of the new functions (e.g., MPI_Comm_agree) are prefixed with OMPI instead of MPI. So MPI_Comm_agree would be OMPI_Comm_agree. This namespace change is required to differentiate between standard interfaces (i.e., MPI) and non-standard interfaces (i.e., OMPI). Note that this only applies to the new interfaces, current standard interfaces remain the same (e.g., MPI_Send).

For a complete list of available interfaces you can either look in your install directory under:

$INSTALL_ROOT/include/openmpi/ompi/mpiext/ftmpi/mpiext_ftmpi_c.h
$INSTALL_ROOT/include/openmpi/ompi/mpiext/ftmpi/mpiext_ftmpi_f77.h
$INSTALL_ROOT/include/openmpi/ompi/mpiext/ftmpi/mpiext_ftmpi_f90.h

or in the source tree under:

$SOURCE_ROOT/ompi/mpiext/ftmpi/mpiext_ftmpi_c.h
$SOURCE_ROOT/ompi/mpiext/ftmpi/mpiext_ftmpi_f77.h

Building:

You will build your MPI applications as normal (e.g., mpicc).

shell$ mpicc -g -Wall -o my-app my-app.c
shell$ mpif77 -g -Wall -o my-app my-app.f
shell$ mpif90 -g -Wall -o my-app my-app.f90

Running:

At runtime you will just need to add one additional flag to the command line (-am ft-enable-mpi). This command line option activates the fault tolerance features of Open MPI:

shell$ mpirun -np 8 -am ft-enable-mpi ./my-app

Testing

The following example will help you test your environment.

In this example, an error handler is set on MPI_COMM_WORLD. MPI_COMM_WORLD is then split into two different communicators. One sub-communicator uses the error handler and the other uses the default error handler of MPI_ERRORS_ARE_FATAL. All processes (except the one that is calling MPI_Abort) wait for the appropriate number of calls to the error handler before calling MPI_Finalize. One process (namely the last rank in MPI_COMM_WORLD) calls abort on the sub-communicator that is using the default error handler. The MPI_Abort call will terminate all processes in the sub-communicator, and notify all other processes through the error handler registered on the communicator.

This example should run to completion and exit with a return value of 0.

Source:
abort_subcomm_fatal.c

The following is an example of the expected output from a run of this example program:

shell$ mpicc -g -Wall -o abort_subcomm_fatal abort_subcomm_fatal.c

shell$ mpirun -np 8 -am ft-enable-mpi abort_subcomm_fatal
0 of 8: # Signals  (-s) =   4
 0 of  8) Waiting for 4 failures
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI COMMUNICATOR 3 SPLIT FROM 0 
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes
in this communicator. Depending on the use of error handlers in
this application, and how Open MPI was configured this may also
cause Open MPI to kill all MPI processes in the job.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
 0 of  8) Error Handler: (Comm = MCW)   4 Failed Ranks:     5,   4,   6,   7.
 1 of  8) Error Handler: (Comm = MCW)   4 Failed Ranks:     5,   4,   6,   7.
 3 of  8) Error Handler: (Comm = MCW)   4 Failed Ranks:     5,   4,   6,   7.
 2 of  8) Error Handler: (Comm = MCW)   4 Failed Ranks:     5,   4,   6,   7.
 0 of  8) MPI_Probe() Error: Some rank failed (error =  54)
 1 of  8) MPI_Probe() Error: Some rank failed (error =  54)
 2 of  8) MPI_Probe() Error: Some rank failed (error =  54)
 3 of  8) MPI_Probe() Error: Some rank failed (error =  54)
 0 of  8) Finalize

shell$ echo $?
0

If you would like other examples, or have any problems let us know.