killing a process gone bad.
Neil Rickert
rickert at mp.cs.niu.edu
Fri Nov 2 04:52:43 AEST 1990
In article <1119 at massey.ac.nz> GEustace at massey.ac.nz (Glen Eustace) writes:
>We recently had the exact situation described in the previous
>posting. There was a little more code involved but the net effect
>was the same. All attempts to clear out the system failed as there
>was no spare CPU available to allow remedial action to be taken. The
>problem was cured by a reboot.
>
>Following our problem, the perpertrator posted to comp.unix.questions
>to find out what we could have done. We received various replies
>including the 'kill -9 -1' variety.
>
We have 10 processors. Simple killing of replicating processes never
works, because more are created as fast as old ones are killed off. I
regularly see students who inadvertently create the problem, and finish
up running out of processes (the local per-user limit is 50).
I have NEVER had to reboot to resolve this problem. My experience is
with a BSD system, so may not apply to SysV.
Here are three simple approaches to try:
(1) The simple-minded approach.
Look for a file which the programs depend on. Try removing or
renaming that file. In particular, if the replicating process
seems to be a shell script, look for a shell script in the user's
directory named 'test'.
(2) The slow and tediou method.
This is a method I sometimes ask the student and/or his instructor
to use. It is somewhat slow, as it requires killing all the processes
individually. It usually works.
Step 1. Find a list of the bad processes. If the student is doing
this himself, he can ask a friend on a different account to do a
'ps uax|grep user' for this purpose. Failing that, he should be
able to login, and then used 'exec ps ug'. This will give the list
of processes, but log him out again.
Step 2. Armed with a list of process IDs, start killing them with
the STOP signal.
exec /bin/kill -STOP pid pid pid ...
The idea is to prevent further replication, but keep the processes
in place so that you are always at the limit. This step, and Step 1
may have to be repeated several times to stop them all.
Step 3. Start killing the STOPPED processes. To do this you
will need the output of 'ps l'. You must not kill a child before
killing the parent. Killing the child may cause the parent to
wake up, and go back to its errant ways of replicating itself.
Most of the time when you see this some of the processes have
process 1 as the parent ID. The procedure is to kill all of the
errant processes whose PPID is 1. Keep repeating this step till
they are all gone. Usually this becomes easier as you proceed,
for you stop getting the 'out of processes' message after a killing
a few, and no longer need to 'exec /bin/kill' and relogin after every
try.
(3) The brute force method.
I posted a script to do this recently. It was posted as article
<1990Oct26.140851.11707 at mp.cs.niu.edu>. Read that article for
full information. It requires that you be root to execute it,
and it requires that the perpetrator's login shell be 'csh'
(because 'kill' is then builtin and doesn't require a new process).
The basic idea is 'blocking'. You keep the number of processes at
the limit, so as to prevent further replication. The script does
the following:
for each errant process
create a new process (/bin/csh) for the user.
kill the errant process
the new process exec's to 'sleep 10 minutes' so
as to be relatively harmless.
If the processes are dying as well as replicating, my script may
need to be rerun a few times. But, regardless, it soon creates
enough sleeps under the userid that further replication of all
errant processes is impossible, so they either all die out
naturally, or sit around long enough to be killed.
I have thought of rewriting the script as a C-program. It would be SUID,
so that anyone could use it. Basically it would allow a user to type
'exec superkill' to kill all of his processes. I have never bothered to do
this because the problem does not seem to crop up often enough to go to
the trouble.
--
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
Neil W. Rickert, Computer Science <rickert at cs.niu.edu>
Northern Illinois Univ.
DeKalb, IL 60115. +1-815-753-6940
More information about the Comp.unix.misc
mailing list