Software – Fault Tolerance Research Hub

ULFM Docker Package

Posted on November 14, 2021 by Aurelien Bouteiller

There are many ways to install ULFM. For performance evaluation, large scale experiments and platforms, you should follow the instructions from the Open MPI ULFM Readme. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine.Docker is available for a wide range of systems (MacOS, Windows, Linux).You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make”and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:/sandbox abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v $PWD:/sandbox abouteiller/mpi-ft-ulfm mpirun –with-ft ulfm –map-by :oversubscribe –mca btl tcp,self’ Run some example Continue reading ULFM Docker Package→

ULFM 4.0.2u1

Posted on November 18, 2019 by Aurelien Bouteiller

The ICL ULFM team is happy to announce ULFM v4.0.2u1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.2). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.2 read the changelog. This is a stability and upstream parity upgrade, moving ULFM from version 4.0.1 of Open MPI to the latest stable (v4.0.2, October 2019 #cb5f4e737a, ulfm #0e249ca1). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI. Contents Features This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the Continue reading ULFM 4.0.2u1→

ULFM 2.1rc1 Open MPI v4.0.1 based

Posted on August 12, 2019 by Aurelien Bouteiller

The ICL ULFM team is happy to announce ULFM v4.0.1ulfm2.1rc1, a new implementation of the MPI extension handling faults in sync with the current Open MPI release (v4.0.1). Innumerable new features have been added both to Open MPI and to ULFM, we will focus on this announce on the ULFM ones. The information about what is new in Open MPI 4.0.1 read the changelog. This is a stability and upstream parity upgrade, moving ULFM from an old unreleased version of Open MPI to the latest stable (v4.0.1, May 2019 #b780667). It improves stability, performance and facilitates future release tracking with the stable releases of Open MPI. Contents Features This implementation conforms to the User Level Failure Mitigation (ULFM) MPI Standard draft proposal. The ULFM proposal is developed by the MPI Forum’s Fault Tolerance Working Group to support the continued operation of MPI programs after crash (node failures) have impacted the execution. The key principle is that no MPI call (point-to-point, collective, RMA, IO, …) can block indefinitely after a failure, but must either succeed or raise an MPI error. This implementation produces the three supplementary error codes and five supplementary interfaces defined in the communicator section of the Continue reading ULFM 2.1rc1 Open MPI v4.0.1 based→

ULFM 2.1a1 Docker Package

Posted on November 8, 2018 by Aurelien Bouteiller

There are many ways to install ULFM. For performance evaluation, large scale experiments and platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. The 2.1a1 Docker image is a bugfix release compared to the previous 2.0rc version. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine.Docker is available for a wide range of systems (MacOS, Windows, Linux).You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. yum install docker, apt-get docker-io, port install docker-io, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make”and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:/sandbox:Z abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v Continue reading ULFM 2.1a1 Docker Package→

ULFM 2.0rc Docker package

Posted on November 10, 2017 by Aurelien Bouteiller

There are many ways to install ULFM. For large scale experiments or large platforms, you should follow the instructions from the ULFM 2.0 repository. However, for a quick test, or for a small non-performance critical test, one might want to spend time on working on the concepts instead of installing. Thus, we provide a docker image for those who want to quickly test it’s capabilities. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine. Docker is available for a wide range of systems (MacOS, Windows, Linux). You can install Docker quickly, either by downloading one of the official builds for MacOS or Windows, or by installing Docker from your Linux package manager (e.g. `yum install docker`, `apt-get docker-io`, `port install docker-io`, etc.) In a terminal, Run docker run hello-world to verify that the docker installation works. Load the pre-compiled ULFM Docker machine into your Docker installation docker pull abouteiller/mpi-ft-ulfm Source the docker aliases in a terminal, this will redirect the “make” and “mpirun” command in the local shell to execute in the Docker machine. alias make=’docker run -v $PWD:$PWD:V -w $PWD abouteiller/mpi-ft-ulfm make’ alias mpirun=’docker run -v $(pwd):$(pwd):V -w $(pwd) abouteiller/mpi-ft-ulfm mpirun –oversubscribe -mca btl tcp,self’ Run Continue reading ULFM 2.0rc Docker package→

ULFM 2.0

Posted on November 3, 2017 by George Bosilca

ULFM 2.0 release Continue reading ULFM 2.0→

Try the Docker packaged ULFM fault tolerant MPI

Posted on February 24, 2017 by Aurelien Bouteiller

To support the SC’16 Tutorial, we have designed a self contained Docker image. This packaged docker image contains everything you need to compile, and run the tutorial examples, in a contained sandbox. Docker can be seen as a lightweight virtual machine, running its own copy of an operating system, but without the heavy requirement of a full-blown hypervisor. We use this technology to package a very small Linux distribution containing gcc, mpicc, and mpirun, as needed to compile and run natively your fault tolerant MPI examples on your host Linux, Mac or Windows desktop, without the effort of compiling a production version of ULFM Open MPI on your own. Content: 1. A Docker Image with a precompiled version of ULFM Open MPI 1.1. 2. The tutorial hands-on example. 3. Various tests and benchmarks for resilient operations. 4. The sources for the ULFM Open MPI branch release 1.1. Using the Docker Image 1. Install Docker You can install Docker quickly, either by downloading one of the official builds from http://docker.io for MacOS and Windows, or by installing Docker from your Linux or MAcOS package manager (i.e. yum install docker, apt-get docker-io, brew/port install docker-io). Please refer to the Docker installation Continue reading Try the Docker packaged ULFM fault tolerant MPI→

ULFM-1.1 Release

Posted on November 14, 2015 by George Bosilca

ULFM has reached the 1.1 milestone, a minor release, crushing few bugs identified by our users and developers. Focus has been toward improving stability, feature coverage for intercommunicators, and following the updated specification for MPI_ERR_PROC_FAILED_PENDING. Addition of the MPI_ERR_PROC_FAILED_PENDING error code, as per newer specification revision. Properly returned from point-to-point, non-blocking ANY_SOURCE operations. Alias MPI_ERR_PROC_FAILED, MPI_ERR_PROC_FAILED_PENDING and MPI_ERR_REVOKED to the corresponding standard blessed – extension- names MPIX_ERR_xxx. Support for Intercommunicators: Support for the blocking version of the agreement, MPI_COMM_AGREE on Intercommunicators. MPI_COMM_REVOKE tested on intercommunicators. Disabled completely (.ompi_ignore) many untested components Changed the default ORTE failure notification propagation aggregation delay from 1s to 25ms. Added an OMPI internal failure propagator; failure propagation between SM domains is now immediate. Bugfixes: SendRecv would not always report MPI_ERR_PROC_FAILED correctly. SendRecv could incorrectly update the status with errors pertaining to the Send portion of the Sendrecv. Revoked send operations are now always completed or remote cancelled and may not deadlock anymore. Cancelled send operations to a dead peer will not trigger an assert when the BTL reports that same failure. Repeat calls to operations returning MPI_ERR_PROC_FAILED will eventually return MPI_ERR_REVOKED when another process revokes the communicator. Get the source and happy hacking, Continue reading ULFM-1.1 Release→

ULFM 1.0 Announced

Posted on August 27, 2015 by herault

The major 1.0 milestone has been reached for the User Level Failure Mitigation compliant fault tolerant MPI. We have focused on improving performance, both before and after the occurence of failures. The list of new features includes: Support for the non-blocking version of the agreement, MPI_COMM_IAGREE. Compliance with the latest ULFM specification draft. In particular, the MPI_COMM_(I)AGREE semantic has changed. New algorithm to perform agreements, with a truly logarithmic complexity in number of ranks, which translates into huge performance boosts in MPI_COMM_(I)AGREE and MPI_COMM_SHRINK. Meet us at SC’15 to learn more about the novel algorithm we designed! New algorithm to perform communicator revocation. MPI_COMM_REVOKE performs a reliable broadcast with a fixed maximum output degree, which scales logarithmically with the number of ranks. Meet us at EuroMPI’15 to learn more about the Revoke algorithm we designed! Improved support for our traditional network layer: TCP: fully tested SM: fully tested (with the exception of XPMEM, which remains unsupported) Added support for High Performance networks Open IB: reasonably tested uGNI: reasonably tested The tuned collective module is now enabled by default (reasonably tested), expect a huge performance boost compared to the former basic default setting Back-ported PBS/ALPS fixes from Open MPI Continue reading ULFM 1.0 Announced→

ULFM Beta 3

Posted on October 10, 2012 by George Bosilca

The third beta for the ULFM implementation in Open MPI has been posted. The changelog is relatively small and can be found in the Release Notes section of ULFM. The tarballs can be found in the Downloads section and instructions for use can be found in the Usage Guide.