ulfm – Page 2 – Fault Tolerance Research Hub

Simplifying the ACK/GET_ACKED couple

Posted on August 26, 2019 by Aurelien Bouteiller

In this post, we will discuss two alternative proposals to simplify the MPIX_COMM_FAILURES_ACK and MPIX_COMM_FAILURES_GET_ACKED in the User Level Failure Mitigation MPI specification draft. These functions are particularly useful for users whose application need to introspect about process failures and resume point-to-point communication capabilities independently from other MPI ranks. As such, they are instrumental in reducing the cost of recovery in applications with mostly decoupled communication patterns, such as master-worker, partial differential equation solvers, asynchronous methods, etc. Dealing with wildcard receptions These two functions serve two complimentary purposes in the ULFM specification. The first purpose is to identify failed processes independently from other MPI ranks, while the second is to restore the ability to post receptions from MPI_ANY_SOURCE. Consider the code in Example 1. If any process has failed, the reception must be interrupted as it otherwise risks deadlocking if the actual sender has crashed. The crux of the issue now becomes how to restore the capability of using wildcard receptions again. In single threaded programs, multiple simple options would be available. For example, the implementation may report a process failure only once and then automatically re-enable wildcard receptions. This approach however become rife with race conditions in Continue reading Simplifying the ACK/GET_ACKED couple→

ULFM 2.1rc1 Open MPI v4.0.1 based

Posted on August 12, 2019 by Aurelien Bouteiller

The ICL ULFM team is happy to announce ULFM v4.0.1ulfm2.1rc1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.1). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.1 read the changelog. This is a stability and upstream parity upgrade, moving ULFM from an old unreleased version of Open MPI to the latest stable (v4.0.1, May 2019 #b780667). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI. Contents Features This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the Continue reading ULFM 2.1rc1 Open MPI v4.0.1 based→

SC’18 Tutorial

Posted on November 9, 2018 by George Bosilca

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’18 (somewhat similar to past incarnations). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we Continue reading SC’18 Tutorial→

ULFM 2.1a1 Docker Package

Posted on November 8, 2018 by Aurelien Bouteiller

There are many ways to install ULFM. For performance evaluation, large scale experiments and platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. The 2.1a1 Docker image is a bugfix release compared to the previous 2.0rc version. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine.Docker is available for a wide range of systems (MacOS, Windows, Linux).You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make”and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:/sandbox:Z abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v Continue reading ULFM 2.1a1 Docker Package→

EuroMPI’18 tutorial

Posted on September 23, 2018 by George Bosilca

Following the success of the first joint tutorial with the VeloC team, we decided to follow-up with a second incarnation of this mixed tutorial at EuroMPI’18. Bogdan Nicolae, Franck Capello and George Bosilca will present this tutorial titled Resilience in parallel applications. The tutorial will two complementary fault management techniques to empower application developers to deal with various types of failures directly at application-level, increasing the opportunities to reduce the resilience overhead with a holistic support from all layers: hardware and software as well as from the parallel programming paradigm. The tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches: application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and user-level failure mitigation (as demonstrated through ULFM extension to the MPI standard). The tutorial will use the following decks of slides: Introduction, VeloC and ULFM as well as a set of examples for VeloC and ULFM. For the hands-on the participants are expected to bring their own laptop, running either Windows, Linux or Mac OS X with Docker installed. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual Continue reading EuroMPI’18 tutorial→

EuroPar’18 tutorial

Posted on August 27, 2018 by George Bosilca

The ULFM team is happy to announce that a joint tutorial on resilience with the VeloC team has been accepted at EuroPar’18. Bogdan Nicolae, Franck Capello and George Bosilca will present this tutorial titled Application-driven Fault-Tolerance for High Performance Distributed Computing. The tutorial will focus on few approaches to empower application developers to deal with different types of failures at application-level, increasing the opportunities to reduce the resilience overhead with a holistic support from all layers: hardware and software as well as from the parallel programming paradigm. The tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches: application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and user-level failure mitigation (as demonstrated through ULFM extension to the MPI standard). The tutorial will use the following decks of slides: Introduction, VeloC and ULFM as well as a set of examples for VeloC and ULFM. For the hands-on the participants are expected to bring their own laptop, running either Windows, Linux or Mac OS X with Docker installed. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine, a perfect Continue reading EuroPar’18 tutorial→

SC’17 Tutorial

Posted on November 11, 2017 by Aurelien Bouteiller

When attending the tutorial please download the material used during the tutorial from the following links: Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker. Continue reading SC’17 Tutorial→

ULFM 2.0rc Docker package

Posted on November 10, 2017 by Aurelien Bouteiller

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine. Docker is available for a wide range of systems (MacOS, Windows, Linux). You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. `yum install docker`, `apt-get docker-io`, `port install docker-io`, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make” and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:$PWD:V -w $PWD abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v $(pwd):$(pwd):V -w $(pwd) abouteiller/mpi-ft-ulfm mpirun –oversubscribe -mca btl tcp,self’ Run Continue reading ULFM 2.0rc Docker package→

ULFM 2.0

Posted on November 3, 2017 by George Bosilca

ULFM 2.0 release Continue reading ULFM 2.0→

Running on Edison

Posted on March 26, 2017 by George Bosilca

ULFM configuration for NERSC Edison Continue reading Running on Edison→