V7 "proc on q" diagnostic message
Dave Edwards
edwards at felix.UUCP
Sat Dec 1 07:13:10 AEST 1984
I've always been amazed at how a bug can exist for years with nobody
noticing it, then within days of each other, several people do.
Peter Gross has noted some "proc on q" printfs from his V7 system. He
does not mention what incarnation of V7 or what type of system he has,
but I have seen something similar on our V7 derived from the MIT 68000
port. Incidentally, our news feed has apparently been a bit flakey
lately, so if this has already been covered, could somebody please
mail me the resolution?
> In V7 Unix (and perhaps others) there is a diagnostic printf "proc on q"
> in the routine setrq() which puts its argument on the run queue. The
> message is the result of finding the given process already on the run
> queue. We were getting this message occasionally in early morning hours.
We were also getting this message, while running a complex application
under development.
> The panic comes when one process sends another an EMT
> signal to indicate that an I/O operation has completed. The kernal traps
> the kill() system call, calls psignal(), which in turn calls setrq().
> setrq() finds that the proc is already on the runq (yet it's p_stat was
> SSLEEP!). Sounds like a race condition to me.
This is almost exactly the same condition that caused it for us. In
looking at things, I discovered that MIT had significantly changed the
way setrun() and wakeup() interact. To be precise, in our version,
psignal() calls setrun(), which calls setrq().
Now, the way psignal() is written, there is indeed a race condition,
having to do with an interrupt causing the process to become runnable
between the check for the process being in SSLEEP state and the spl6()
in setrq(). In my copy of the standard PDP 11 V7 code, there is a
bizarre way of avoiding this condition which relies on wakeup() being
idempotent. This way of doing things is admittedly inefficient and
could also cause other processes to wake up when they shouldn't, so
I don't blame the MIT people for changing it.
However, their change re-opens the race condition. My fix was to
change setrun(), since I don't know what other nasty things might
be done in other procedures which call it. It involves putting an
spl6() around slightly more code and checking for the presence of
the race. It causes the priority to be high for slightly longer,
but I believe it is better than the original V7 code.
I have installed this fix just this week, and we haven't seen the
problem again, although it is pretty rare and heavily dependent
on timing conditions, so I can't vouch for the perfection of my
fix. But if anyone wants the code, let me know.
Dave Edwards
FileNet Corp.
Costa Mesa, Calif.
{ucbvax,decvax,ihnp4,sdcrdcf}!trwrb!felix!edwards
More information about the Comp.unix.wizards
mailing list