User Level Failure Mitigation – Page 3 – Fault Tolerance Research Hub

Try the Docker packaged ULFM fault tolerant MPI

Posted on February 24, 2017 by Aurelien Bouteiller

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own. Content: 1. A Docker Image with a precompiled version of ULFM Open MPI 1.1. 2. The tutorial hands-on example. 3. Various tests and benchmarks for resilient operations. 4. The sources for the ULFM Open MPI branch release 1.1. Using the Docker Image 1. Install Docker You can install Docker quickly, either by downloading one of the official builds from http://docker.io for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation Continue reading Try the Docker packaged ULFM fault tolerant MPI→

ULFM Specification update

Posted on February 24, 2017 by Aurelien Bouteiller

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meetings in 2016 has been posted under the ULFM Specification item. This new update has very few semantic changes. It clarifies the failure behavior of MPI_COMM_SPAWN, and corrects the output values of error codes and status objects returned from functions completing in error. Head to ULFM Specification for more info.

SC' 16 paper: Failure Detection and Propagation in HPC System

Posted on February 10, 2017 by George Bosilca

Continue reading SC' 16 paper: Failure Detection and Propagation in HPC System→

SC’16 Tutorial

Posted on February 10, 2017 by George Bosilca

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker. More information about the tutorial can be found here. Enjoy our promotional video 😉 See you all in Salt Lake City, UT !!!

"Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems" Presented at SC'15

Posted on November 17, 2015 by herault

Continue reading "Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems" Presented at SC'15→

ULFM-1.1 Release

Posted on November 14, 2015 by George Bosilca

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers. Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING. Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations. Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx. Support for Intercommunicators: Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators. MPI_COMM_REVOKE tested on intercommunicators. Disabled completely (.ompi_ignore) many untested components Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms. Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate. Bugfixes: SendRecv would not always report MPI_ERR_PROC_FAILED correctly. SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv. Revoked send operations are now always completed or remote cancelled and may not deadlock anymore. Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure. Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator. Get the source and happy hacking, Continue reading ULFM-1.1 Release→

SC’15 tutorial

Posted on October 2, 2015 by George Bosilca

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. Get the slides part1, part2, and the examples More information about the tutorial can be found here. Enjoy our promotional video 😉 See you all in Austin, TX !!!

Logarithmic Revoke Routine

Posted on August 27, 2015 by Aurelien Bouteiller

Starting with ULFM-1.0, the implementation features a logarithmic revoke operation, with a logarithmically bound per-node communication degree. A paper presenting this implementation will be presented at EuroMPI’15. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain that the Revoke operation can be implemented with a reliable broadcast over the scalable and failure resilient Binomial Graph (BMG) overlay network. Evaluation at scale, on a Cray XC30 supercomputer, demonstrates that the Revoke operation has a small latency, and does not introduce system noise outside of failure recovery periods. Purpose of the Revoke Operation If the communication pattern of the application is complex, the occurrence of failures has the potential to deeply disturb the application and prevent an effective recovery from being implemented. Consider the example in the above figure: as long as no failure occurs, the processes are communicating in a point-to-point pattern (we decide to call plan A). Process Pk is waiting to receive a message from Pk-1, then sends a message to Pk+1 (when such processes exist). Let’s observe the effect of introducing a failure in plan A, and consider that P1 has failed. As only P2 communicates directly with P1, other processes do not Continue reading Logarithmic Revoke Routine→

Logarithmic Agreement Routine

Posted on August 27, 2015 by herault

Starting with ULFM-1.0, the implementation features a purely logarithmic agreement, with no single point of failure. A paper presenting this implementation will be presented during SC’15. We considered a practical agreement algorithm with the following desired properties: the unique decided value is the result of a combination of all values proposed by deciding processes (a major difference with a 1-set agreement), failures consist of permanent crashes in a pseudo-synchronous system (no data corruption, loss of message, or malicious behaviors are considered), the agreement favors the failure-free performance over the failure case, striving to exchange a logarithmic number of messages in the absence of failures. To satisfy this last requirement, we introduced a practical, intermediate property, called Early Returning: that is the capacity of an early deciding algorithm to return before the stopping condition (early or not) is guaranteed: as soon as a process can determine that the decision value is fixed (except if it fails itself), the process is allowed to return. However, because the process is allowed to return early, later failures may compel that process to participate in additional communications. Therefore, the decision must remain available after the processes return, in order to serve unexpected message exchanges until the stopping condition can be established. Unlike a regular early stopping algorithm, not all processes decide and Continue reading Logarithmic Agreement Routine→

ULFM 1.0 Announced

Posted on August 27, 2015 by herault

The major 1.0 milestone has been reached for the User Level Failure Mitigation compliant fault tolerant MPI. We have focused on improving performance, both before and after the occurence of failures. The list of new features includes: Support for the non-blocking version of the agreement, MPI_COMM_IAGREE. Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed. New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK. Meet us at SC’15 to learn more about the novel algorithm we designed! New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks. Meet us at EuroMPI’15 to learn more about the Revoke algorithm we designed! Improved support for our traditional network layer: TCP: fully tested SM: fully tested (with the exception of XPMEM, which remains unsupported) Added support for High Performance networks Open IB: reasonably tested uGNI: reasonably tested The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting Back-ported PBS/ALPS fixes from Open MPI Continue reading ULFM 1.0 Announced→