Checkpointing for Unix?
Bill Reynolds
breynolds at UCSD.EDU
Sat Apr 27 05:11:11 AEST 1991
Submitted-by: breynolds at UCSD.EDU (Bill Reynolds)
I originally posted this to comp.unix.questions. It was then
recommended to me that I post here as well.
>Greetings,
> We are a computational physics group running a network of Sun
>and SGI workstations. We often have long running jobs on many of our
>machines. This leads to problems when a machine needs to be taken down
>that has a job in the third day of a five day run. What we would like
>is a routine to checkpoint a job to a disk file for later reloading
>into memory. I've looked at undump, but isn't adequate, we need to
>restart the job where it was interrupted. I've also looked at condor,
>but it seems to be a fly-with-a-sledgehammer type solution. I'm
>wondering if there are any simple unix/sun/sgi utilities to do
>checkpointing. (I know that such facilities exist for crays).
I would also like to add that such a facility would have to support
fortran and would have to be simple enough to use that someone with
only a background in scientific computing could use it (i.e. no system
calls, no calls to c routines from fortran, etc). It has also been
suggested that I modify the code to undump. I find this a daunting
task (any takers?). (By the way, I have not actually gotten an undump
working for the sun or the sgi).
--
_______________________________________________________________________
| Bill Reynolds
| bill at inls1.ucsd.edu
[ First of all, there is Dan Bernstein's Poor Man's Checkpointing Package,
posted to alt.sources (I think) a month or three ago. Also, one of
the POSIX subgroups specifies checkpointing, that being the main reason
I'm posting this. I will let others (who are likely to be more
knowledgeable about it) comment further, if they wish. -- mod ]
Volume-Number: Volume 23, Number 47
More information about the Comp.std.unix
mailing list