Checkpoints for large jobs

Skip Montanaro montnaro at spyder.crd.ge.com
Sat Aug 11 23:11:23 AEST 1990


On a case-by-case basis, you may be able to modify your applications so they
will recover. For instance, if your application is an iterative solver of
some sort, you may be able to checkpoint the intermediate data periodically.
When the program is restarted, a flag can be set so the program initializes
from the intermediate solution data.

There was a system a few years ago (maybe 1986?) developed at the University
of Wisconsin that allowed jobs to be restarted (modulo some special I/O
situations). It was reported in a USENIX conference of that era.

Also, UNICOS on the CRAY has a checkpointing facility. You might investigate
it, and ask Sequent why they haven't got something similar.


--
Skip (montanaro at crdgw1.ge.com)



More information about the Comp.unix.questions mailing list