This repository contains documentation and implementations for three major projects:
CRAC (Checkpoint/Restart As a C library) provides checkpoint/restart functionality for HPC applications.
DMTCP (Distributed MultiThreaded Checkpointing) provides transparent checkpoint/restart functionality for distributed applications.
MANA: MPI-Agnostic, Network-Agnostic Transparent Checkpointing
The checkpointing process in MANA follows a carefully orchestrated multi-phase approach to ensure application state consistency while maintaining transparency.
The restart process reconstructs the entire application state from checkpoint files while potentially using different MPI implementations or network fabrics.
Every MPI call goes through MANA's interception layer to provide transparency and enable checkpointing.
Virtual objects enable MANA's MPI/network agnosticism by abstracting implementation details.



