CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
Abstract:
Checkpointing is a useful technique for rollback recovery of parallel
applications. While extensive research has been performed on
checkpointing in parallel environments,
there are few checkpointers available to application users on
commercial parallel computers. This paper presents one such
checkpointer: CLIP. CLIP is a user-level library that provides
semi-transparent checkpointing for parallel programs on the Intel
Paragon multicomputer. It is publicly available to Paragon users
at no cost.Conceptually, checkpointing a multicomputer is straightforward.
However, when creating an actual tool for checkpointing a complex
machine like the Paragon, many more issues arise that require careful
design decisions to be made. Sometimes ease-of-use must be
sacrificed for efficiency and/or correctness. This paper details
what these decisions are, and how they were made in CLIP.We also present performance data when checkpointing several
long-running Paragon applications with CLIP. The bottom line is that
a convenient, general-purpose checkpointing tool like CLIP can
provide fault-tolerance on a massively parallel multicomputer like
the Paragon with very good performance.