SC’17 Tutorial

When attending the tutorial please download the material used during the tutorial from the following links: Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker. Continue reading SC’17 Tutorial

ULFM 2.0rc Docker package

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine. Docker is available for a wide range of systems (MacOS, Windows, Linux). You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. `yum install docker`, `apt-get docker-io`, `port install docker-io`, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make” and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:$PWD:V -w $PWD abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v $(pwd):$(pwd):V -w $(pwd) abouteiller/mpi-ft-ulfm mpirun –oversubscribe -mca btl tcp,self’ Run Continue reading ULFM 2.0rc Docker package

Try the Docker packaged ULFM fault tolerant MPI

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own. Content: 1. A Docker Image with a precompiled version of ULFM Open MPI 1.1. 2. The tutorial hands-on example. 3. Various tests and benchmarks for resilient operations. 4. The sources for the ULFM Open MPI branch release 1.1. Using the Docker Image 1. Install Docker You can install Docker quickly, either by downloading one of the official builds from for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation Continue reading Try the Docker packaged ULFM fault tolerant MPI

ULFM Specification update

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meetings in 2016 has been posted under the ULFM Specification item. This new update has very few semantic changes. It clarifies the failure behavior of MPI_COMM_SPAWN, and corrects the output values of error codes and status objects returned from functions completing in error. Head to ULFM Specification for more info.

Logarithmic Revoke Routine

Starting with ULFM-1.0, the implementation features a logarithmic revoke operation, with a logarithmically bound per-node communication degree. A paper presenting this implementation will be presented at EuroMPI’15. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods. Purpose of the Revoke Operation If the communication pattern of the application is complex, the occurrence of failures has the potential to deeply disturb the application and prevent an effective recovery from being implemented. Consider the example in the above figure: as long as no failure occurs, the processes are communicating in a point-to-point pattern (we decide to call plan A). Process Pk is waiting to receive a message from Pk-1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing a failure in plan A, and consider that P1 has failed. As only P2 communicates directly with P1, other processes do not Continue reading Logarithmic Revoke Routine

ANU presents PDE solver with ULFM at IPDPS

Mohsin Ali and Peter Strazdins presented their work on “Application Level Fault Recovery, Using Fault-Tolerant Open MPI in a PDE Solver”, during the IPDPS PDSEC workshop, last week. See the full slides for more details. This novel work joins the growing list of applications benefiting from ULFM to feature fault tolerance; more examples are presented in these applications slides. If you have worked on fault tolerant applications with ULFM, or are thinking about doing so, please contact us.

Preparing for June MPI Forum meeting

In preparation for the June MPI forum meeting, the specification has received some updates. The most prominent changes are: The exposed memory in an RMA window may be completely undefined after a failure has occured. MPI_Comm_agree now operates a binary AND on the flag argument. Examples have been corrected to use error classes, instead of error codes, when relevant. The latest version is available in the ULFM specification area

ULFM Specification update

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meeting in Chicago in December 2013 has been posted under the ULFM Specification item. This new update adds a new error code to separate process failure errors from non-impacted requests when they remain pending (MPI_ERR_PROC_FAILED_PENDING), and adds new examples. Head to ULFM Specification for more info.