Checkpoints for large jobs

Mike Litzkow mike at cream.cs.wisc.edu
Wed Aug 15 01:47:18 AEST 1990


Yes, checkpointing is one part of the Condor system, (previously called RU).
Condor uses cycles on idle workstations by migrating processes to them.  When
the workstations subsequently come under use by their normal users, the condor
jobs are checkpointed, and later moved to another idle workstation to continue
execution.

The checkpointing is accomplished by causing the process to dump core, then
combining parts of the core file with parts of the original executable.  The
software keeps track of what file have been opened and re-opens them after
return from a checkpoint.   This is accomplished by linking the user program
with special versions of "crt0.o" and "libc.a".

Condor is available without charge by anonymous ftp from "shorty.cs.wisc.edu"
(128.105.2.8).  Just log in as "ftp" and give your user name for a password.
Then "cd" to the condor directory and take a look at the Readme file.  You will
be instructed to fetch a compressed binary file, remember to have your ftp
set to "binary" mode for that.

The checkpointing is set up so you can use it without process migration or
remote execution if that is desired.  It is able to run and compile on a
Sequent Symmetry.

-- mike



More information about the Comp.unix.questions mailing list