SC’23 Tutorial

Featured

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’23. This year, the tutorial will be in a single, full day session on Sunday, November 12th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions (3, 4) introduce a more detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during Continue reading SC’23 Tutorial

SC’22 Tutorial

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’22. This year, the tutorial will be in a single, full day session on Monday, November 14th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions (3, 4) introduce a more detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during Continue reading SC’22 Tutorial

SC’21 Tutorial

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’21. This year, the tutorial will be in a single, full day session on Sunday, November 14th. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. slides, slides, slides Introduction and Methods for Fault Tolerance slides, slides, slides, slides Performance Models slides Handling Errors in MPI Applications slides Recovering MPI Application Examples docker ULFM Fault Tolerant MPI on your desktop The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction (1, 2) covers different existing resilience approaches, including modeling solutions such as C/R, Continue reading SC’21 Tutorial

SC’20 Tutorial

The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’20. This year, the tutorial will be split in two sessions over two days. Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. Monday the 9th material slides Introduction and Methods for Fault Tolerance slides Failures and Silent Errors slides Checkpointing: the Young/Daly formula slides In-memory & Hierarchical Checkpointing, Replication slides Silent Errors Tuesday the 10th material slides Handling Errors in MPI Applications slides Recovering MPI Application Examples docker ULFM Fault Tolerant MPI on your desktop The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the Continue reading SC’20 Tutorial

SC’19 Tutorial

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’19. The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker. More Continue reading SC’19 Tutorial

SC’18 Tutorial

Before attending the tutorial please download the following material, you will need it during the tutorial for following the theoretical and practical part. Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’18 (somewhat similar to past incarnations). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitioners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of hand-on examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we Continue reading SC’18 Tutorial

EuroMPI’18 tutorial

Following the success of the first joint tutorial with the VeloC team, we decided to follow-up with a second incarnation of this mixed tutorial at EuroMPI’18. Bogdan Nicolae, Franck Capello and George Bosilca will present this tutorial titled Resilience in parallel applications. The tutorial will two complementary fault management techniques to empower application developers to deal with various types of failures directly at application-level, increasing the opportunities to reduce the resilience overhead with a holistic support from all layers: hardware and software as well as from the parallel programming paradigm. The tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches: application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and user-level failure mitigation (as demonstrated through ULFM extension to the MPI standard). The tutorial will use the following decks of slides: Introduction, VeloC and ULFM as well as a set of examples for VeloC and ULFM. For the hands-on the participants are expected to bring their own laptop, running either Windows, Linux or Mac OS X with Docker installed. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual Continue reading EuroMPI’18 tutorial

EuroPar’18 tutorial

The ULFM team is happy to announce that a joint tutorial on resilience with the VeloC team has been accepted at EuroPar’18. Bogdan Nicolae, Franck Capello and George Bosilca will present this tutorial titled Application-driven Fault-Tolerance for High Performance Distributed Computing. The tutorial will focus on few approaches to empower application developers to deal with different types of failures at application-level, increasing the opportunities to reduce the resilience overhead with a holistic support from all layers: hardware and software as well as from the parallel programming paradigm. The tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches: application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and user-level failure mitigation (as demonstrated through ULFM extension to the MPI standard). The tutorial will use the following decks of slides: Introduction, VeloC and ULFM as well as a set of examples for VeloC and ULFM. For the hands-on the participants are expected to bring their own laptop, running either Windows, Linux or Mac OS X with Docker installed. Using the Docker Image Install Docker Docker can be seen as a “lightweight” virtual machine, a perfect Continue reading EuroPar’18 tutorial

SC’17 Tutorial

When attending the tutorial please download the material used during the tutorial from the following links: Theoretical Session slides Practical Session slides Examples ULFM docker The ULFM team is happy to announce that our day-long tutorial on fault tolerance has been accepted at SC’17 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of predicting, detecting and finally dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures and their potential impact on applications, and up to practitionners with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial is divided in two parts, one addressing the theoretical aspect and one focused on the ULFM extension of the MPI programming model. The theoretical introduction covers different existing resilience approaches, including modeling solutions such as C/R, buddy checkpointing and Algorithmic-Based Fault Tolerance (ABFT). The practical sessions introduces a mode detailed description of the ULFM extensions and a set of handon examples. To facilitate public interaction with ULFM, during but also outside the tutorial, we have created an ULFM docker. Continue reading SC’17 Tutorial

SC’16 Tutorial

The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker. More information about the tutorial can be found here. Enjoy our promotional video 😉 See you all in Salt Lake City, UT !!!