Publications : Information Sciences Institute

A comprehensive user-level checkpointing strategy for MPI applications

Abstract

As computational clusters increase in size, their mean-time-to-failure reduces drastically. After a failure, most MPI checkpointing solutions require a restart with the same number of nodes. This necessitates the availability of multiple spare nodes, leading to poor resource utilization. Moreover, most techniques require a central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing.
We propose a scalable fault-tolerant MPI based on LAM/MPI which supports user-level checkpointing, migration, and replication. Our contributions extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate both centralized storage and SAN-based solutions and show that they are not scalable, particularly after 64 CPUs. Our migration strategy is the first to make no assumptions on restart topologies, eliminating the need for spare nodes. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes. We show that checkpointing and replication can be achieved with much lower overhead than current techniques and near transparency to the end user while still providing fault resilience.

Date: January 1, 1970
Authors: J Walters, Vipin Chaudhary
Journal: Technical report, TR 2007-1
Publisher: The State University of New York

View Paper