Tag Archives: event
SC’16 Tutorial
The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’16 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. The tutorial was divided in two parts, one theoretical (covering the different existing approaches and their modeling), and one practical. The slides for the 2 parts are available (theory and practice), as well as the handon examples. Unlike the previous years, we have embraced new technologies to facilitate the public interaction with ULFM: enjoy the ULFM docker. More information about the tutorial can be found here. Enjoy our promotional video 😉 See you all in Salt Lake City, UT !!!
"Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems" Presented at SC'15
SC’15 tutorial
The ULFM team is happy to announce that we will be teaching a day-long tutorial on fault tolerance at SC’15 (somewhat similar to last year tutorial). The tutorial will cover multiple theoretical and practical aspects of dealing with faults. It targets a wide scientific community, starting from scientists trying to understand the challenges of different types of failures, and up to advanced users with prior experience with fault-related topics that want to get a more precise understanding of the available tools allowing them to efficiently deal with faults. Get the slides part1, part2, and the examples More information about the tutorial can be found here. Enjoy our promotional video 😉 See you all in Austin, TX !!!
Tutorial @ SC'14: FAULT-TOLERANCE FOR HPC: THEORY AND PRACTICE
Preparing for June MPI Forum meeting
In preparation for the June MPI forum meeting, the specification has received some updates. The most prominent changes are: The exposed memory in an RMA window may be completely undefined after a failure has occured. MPI_Comm_agree now operates a binary AND on the flag argument. Examples have been corrected to use error classes, instead of error codes, when relevant. The latest version is available in the ULFM specification area
March MPI Forum Meeting
A full presentation of ULFM has been made in front of the MPI Forum, during the march San Jose meeting, where we received good feedback. We also presented a deck of slides summarizing users’ success stories with ULFM.
Flyer for ULFM at SC’13
An flyer has been created for SC’13. It puts further emphasis on use-cases and features updated graphs showcasing performance while failures are being digested by the system. Download the flyer
ULFM at SC 12
The User Level Failure Mitigation team will be at this year’s Supercomputing conference in Salt Lake City. Come visit us at The University of Tennessee booth (#3010) to hear more about our work as well as lots of other interesting research going on at UTK.
ULFM Flyer
We have a flyer for User Level Failure Mitigation to show new results and design. It can be found at this link: SC12 ULFM Flyer.