Section: New Results
Efficient checkpoint/verification patterns
Participants : Anne Benoit, Saurabh K. Raina [Jaypee Institute of Information Technology] , Yves Robert.
Errors have become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct.
In this work, we analytically determine the best balance of verifications and checkpoints
so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with
This work has been published in the International Journal of High Performance Computing Applications [8] .