Help on deciphering crash
Chris Torek
chris at mimsy.UUCP
Mon Jan 5 02:38:24 AEST 1987
>In article <3645 at sdcrdcf.UUCP> davem at sdcrdcf.UUCP (David Melman) writes:
>>Our Vax 750 running 4.2BSD has occassionally been crashing with:
>>machine check 2: cp tbuf par fault
>> va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5
>> busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016
>In article <4891 at mimsy.UUCP>, chris at mimsy.UUCP (Chris Torek) writes:
>>There are two interrelated fixes for this. Both are already in
>>4.3BSD. The first is that some tbuf parity errors can be corrected [...]
In article <1419 at cit-vax.Caltech.Edu> mangler at cit-vax.Caltech.Edu
(System Mangler) writes:
>Read the registers. This is a cache parity error, not a tbuf parity
>error. Never mind that 4.[23] doesn't distinguish between the two.
Sure enough. I never bothered to read the bits, knowing that `this
occurs all the time and is always a tbuf error'.
>We get these all the time. There are two ways to "fix" it: swap
>L0003 boards until you get a good one ($$$), or change the machine
>check handler to flush the cache and return. Now, can anyone tell
>me how to flush the cache?
Maybe the microcode fix helps this too? I have never seen a cache
error here (but tb errors were extremely rare too: probably a
consequence of our ordering our 750s with Ultrix 1.0 way back when.)
Anyway, you could try disabling the cache:
mtpr(CADR, 1); /* CADR is register 0x25 */
but that will probably slow the machine to a crawl. Disabling
and reenabling the cache might well flush it, though. If
mtpr(CADR, 1);
mtpr(CADR, 0);
does not clear the problem, perhaps reenabling it after a long
delay will:
mtpr(CADR, 1);
timeout(cacheenable, (caddr_t) 0, 10*hz);
...
cacheenable()
{
mtpr(CADR, 0);
}
But according to the registers I can read above (DEC's latest VAX
Hardware Handbook does NOT include machine check frames---why?),
returning may not help too much in this case, because the machine
check error summary register (mcesr) has bit 8 set, bus error.
Returning to the failed instruction may well not retry the failed
read. Since it occurred in kernel mode, that might bring the
machine down anyway.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP: seismo!mimsy!chris ARPA/CSNet: chris at mimsy.umd.edu
More information about the Comp.unix.questions
mailing list