Spurious Errors: lack of MPI Progress and Failure Detection
A very common mishap while developing parallel applications is to assume an application running at small scales automatically translates into successful large scale runs. Such optimistic views are largely unproven in general, but tools exists to help with the validation process. However, adding resilience to a parallel application has a tendency to increase the likelihood of consistency errors, and unfortunately no tools to help through this process currently exist. In some cases, few common sense practices could save hours of debugging, and improve the quality of the parallel application. As you work on you MPI fault tolerant application, you discover it runs fine for small scales and small data sets, but when increasing the number of processes or the computational load on the participating processes, spurious faults seems to be ‘injected’ for no good reason. It might be easy to blame it on the underlying libraries, but before we go there it is possible you are observing an inter-operability issue between the different layers of the resilient software stack. More precisely, you may be observing the effect of the lack of MPI progress on the failure detector within the MPI library. MPI Progress (and lack thereof) The MPI Continue reading Spurious Errors: lack of MPI Progress and Failure Detection