Checkpoints for large jobs

Sat Aug 11 23:11:23 AEST 1990

On a case-by-case basis, you may be able to modify your applications so they
will recover. For instance, if your application is an iterative solver of
some sort, you may be able to checkpoint the intermediate data periodically.
When the program is restarted, a flag can be set so the program initializes
from the intermediate solution data.

There was a system a few years ago (maybe 1986?) developed at the University
of Wisconsin that allowed jobs to be restarted (modulo some special I/O
situations). It was reported in a USENIX conference of that era.

Also, UNICOS on the CRAY has a checkpointing facility. You might investigate
it, and ask Sequent why they haven't got something similar.

--
Skip (montanaro at crdgw1.ge.com)