Implementing checkpointing and recovery algorithm for fault tolerant computation

Liew, Siew Wan

Publication:
Implementing checkpointing and recovery algorithm for fault tolerant computation

Date

2012-06-01

Authors

Liew, Siew Wan

Abstract

Checkpoint-recovery is normally used for implementing fault tolerance in multicomputer systems. Checkpoints can be used in conjunction with exception handling abstractions to recover from exceptional or erroneous events, to support debugging or replay mechanisms, or to facilitate algorithms that rely on speculative evaluation. During failure-free operation the process states are regularly saved, and after a fault is detected, the system is recovered to a previous saved state. Checkpointing is usually used to minimize the execution time for long-running programs in existence of failures. Optimal checkpointing approach may be determined to reduce the expected execution time. This project provides a study of few techniques including coordinated, uncoordinated and communication-induced checkpointing; also discusses the overall comparison of performance of between them. Apart from static checkpointing scheme, a few random checkpointing approaches has been implemented for dynamical checkpointing scheme in this thesis. In order to test the effective and reliability of the proposed methods, the qualitative and experimental analyses are applied. The proposed random checkpointing appears to be sufficiently effective as compared to existing checkpointing produced by the traditional methods. These studies suggest that the proposed checkpointing method have great potential for fault tolerant in any large software applications.

URI

https://erepo.usm.my/handle/123456789/20595

Collections

Pusat Pengajian Kejuruteraaan Elektrik dan Elektronik - Monograf

Full item page

Publication:
Implementing checkpointing and recovery algorithm for fault tolerant computation

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Publication: Implementing checkpointing and recovery algorithm for fault tolerant computation

Options

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
Implementing checkpointing and recovery algorithm for fault tolerant computation