Checkpoint clarification
David Gundlach
david at rolf.stat.uga.edu
Wed Aug 22 12:04:03 AEST 1990
Hello all
I am a co-worker of the guy that got you started on all of this,
and I would like to clarify (maybe) some stuff.
We are the System Support for the Statistics department at the
University of Georgia. We have many professors who know some
Fortran77 and wish to do number crunching. They have worked with
computers for quite a while, but not from a programmer's angle--
they simply write the program that will crunch their arrays (or
whatever).
I am an undergraduate CS major, and Mark has just entered grad
school here (he got his CS BS). I don't know much at all about
the kernel and job control (barely enough to be dangerous), and
Mark can only put in a few hours a week. Thus, neither of us can
take an application and port it to C and write in the checkpointing
functions. Unfortunately, not every 'developer of long-running
applications' (I think Dr. Hutcheson would like that title :-) can
really be computer literate.
This is why we need something already written.
We got quite a few answers, and Mark's thank you is below. Our one
comfort is that these applications, by their very nature, exist almost
entirely in ram throughout their execution. With that, we may be
able to proceed.
David Gundlach david at rolf.stat.uga.edu
UGA Statistics gundlach at csun2.cs.uga.edu
University of Georgia 404/542-3289 or 404/542-5232
"I'm a reasonably good speller, but a lousy typist." - me
begin included thank you
------------------------
I got a dizzying array of responses to my question that can basically
be summarized as:
1) it can't be done
2) write your own for each job
3) use Condor
Of course everyone is right... amazing how slippery is the TRUTH.
Below please find MY (in other words what I think they said)
synthesis of the e-mail I got:
There are a number of SPECIAL problems like network connections,
unofficial serial devices, etc. that SHOULD never be handled
by the computer (hand-waving works with people not computers).
By in large, however, the run of the mill number cruncher with
input file(s) and output file(s) can be put in suspended
animation and later awakened without incurring the Wrath of
Khan--er, Kernel.
Condor is the slick way of doing this:
1) it's been written
2) it was written well
3) it works somewhere already (shorty.wisc.cs.edu)
Incidentally Condor is capable of spreading system load around and
other P9-ishness.
So there you have it. Thank you for taking some time out of your
busy schedule to reply to me.
mark
mth at rolf.stat.uga.edu
More information about the Comp.unix.wizards
mailing list