User Level Failure Mitigation – Fault Tolerance Research Hub

SC’23 Tutorial

Featured

Posted on August 2, 2023 by George Bosilca

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’23. This year, the tutorial will be in a single, full day session on Sunday, November 12th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions (3, 4) introduce a more detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during Continue reading SC’23 Tutorial→

ULFM Specification update

Posted on October 31, 2022 by Aurelien Bouteiller

A new version of the ULFM specification accounting for remarks and discussions going on at the MPI Forum Meetings in September 2022 has been posted under the ULFM Specification item. This new update has significant API changes that include a new API pair to control ANY-SOURCE messages (MPI_COMM_GET_FAILED/ACK_FAILED), and the introduction of implicit actions controlled by info keys users can set on communicators (as discussed on our research paper on the topic FTXS’22 Paper: Implicit Actions and Non-blocking Failure Recovery with MPI. Head to ULFM Specification for more info.

SC’22 Tutorial

Posted on October 28, 2022 by Aurelien Bouteiller

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’22. This year, the tutorial will be in a single, full day session on Monday, November 14th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions (3, 4) introduce a more detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during Continue reading SC’22 Tutorial→

FTXS’22 Paper: Implicit Actions and Non-blocking Failure Recovery with MPI

Posted on October 28, 2022 by Aurelien Bouteiller

Our paper, Implicit Actions and Non-blocking Failure Recovery with MPI, has been accepted at the FTXS’22 workshop. This paper presents recent evolutions to the ULFM specification that enable a more asynchronous, implicit style of triggering recovery action, and enable overlap between recovering the application state and the MPI library state.

ULFM Docker Package

Posted on November 14, 2021 by Aurelien Bouteiller

There are many ways to install ULFM. For performance evaluation, large scale experiments and platforms, you should follow the instructions from the Open MPI ULFM Readme. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine.Docker is available for a wide range of systems (MacOS, Windows, Linux).You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make”and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:/sandbox abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v $PWD:/sandbox abouteiller/mpi-ft-ulfm mpirun –with-ft ulfm –map-by :oversubscribe –mca btl tcp,self’ Run some example Continue reading ULFM Docker Package→

SC’21 Tutorial

Posted on November 12, 2021 by George Bosilca

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’21. This year, the tutorial will be in a single, full day session on Sunday, November 14th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. slides, slides, slides Introduction and Methods for Fault Tolerance slides, slides, slides, slides Performance Models slides Handling Errors in MPI Applications slides Recovering MPI Application Examples docker ULFM Fault Tolerant MPI on your desktop The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, Continue reading SC’21 Tutorial→

SC’20 Tutorial

Posted on October 24, 2020 by Aurelien Bouteiller

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’20. This year, the tutorial will be split in two sessions over two days. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. Monday the 9th material slides Introduction and Methods for Fault Tolerance slides Failures and Silent Errors slides Checkpointing: the Young/Daly formula slides In-memory & Hierarchical Checkpointing, Replication slides Silent Errors Tuesday the 10th material slides Handling Errors in MPI Applications slides Recovering MPI Application Examples docker ULFM Fault Tolerant MPI on your desktop The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the Continue reading SC’20 Tutorial→

Spurious Errors: lack of MPI Progress and Failure Detection

Posted on January 21, 2020 by Aurelien Bouteiller

A very common mishap while developing parallel applications is to assume an application running at small scales automatically translates into successful large scale runs. Such optimistic views are largely unproven in general, but tools exists to help with the validation process. However, adding resilience to a parallel application has a tendency to increase the likelihood of consistency errors, and unfortunately no tools to help through this process currently exist. In some cases, few common sense practices could save hours of debugging, and improve the quality of the parallel application. As you work on you MPI fault tolerant application, you discover it runs fine for small scales and small data sets, but when increasing the number of processes or the computational load on the participating processes, spurious faults seems to be ‘injected’ for no good reason. It might be easy to blame it on the underlying libraries, but before we go there it is possible you are observing an inter-operability issue between the different layers of the resilient software stack. More precisely, you may be observing the effect of the lack of MPI progress on the failure detector within the MPI library. MPI Progress (and lack thereof) The MPI Continue reading Spurious Errors: lack of MPI Progress and Failure Detection→

ULFM 4.0.2u1

Posted on November 18, 2019 by Aurelien Bouteiller

The ICL ULFM team is happy to announce ULFM v4.0.2u1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.2). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.2 read the changelog. This is a stability and upstream parity upgrade, moving ULFM from version 4.0.1 of Open MPI to the latest stable (v4.0.2, October 2019 #cb5f4e737a, ulfm #0e249ca1). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI. Features This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the ULFM Continue reading ULFM 4.0.2u1→

SC’19 Tutorial

Posted on November 16, 2019 by George Bosilca

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’19. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker. More Continue reading SC’19 Tutorial→