Publication: Implementing checkpointing and recovery algorithm for fault tolerant computation
Loading...
Date
2012-06-01
Authors
Liew, Siew Wan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Checkpoint-recovery is normally used for implementing fault tolerance in multicomputer systems. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on speculative evaluation. During failure-free operation the process states are regularly saved, and after a fault is detected, the system is recovered to a previous saved state. Checkpointing is usually used to minimize the execution time for long-running programs in existence of failures. Optimal checkpointing approach may be determined to reduce the expected execution time. This project provides a study of few techniques including coordinated, uncoordinated and communication-induced checkpointing; also discusses the overall comparison of performance of between them. Apart from static checkpointing scheme, a few random checkpointing approaches has been implemented for dynamical checkpointing scheme in this thesis. In order to test the effective and reliability of the proposed methods, the qualitative and experimental analyses are applied. The proposed random checkpointing appears to be sufficiently effective as compared to existing checkpointing produced by the traditional methods. These studies suggest that the proposed checkpointing method have great potential for fault tolerant in any large software applications.