Help on deciphering crash

Mon Jan 5 02:38:24 AEST 1987

>In article <3645 at sdcrdcf.UUCP> davem at sdcrdcf.UUCP (David Melman) writes:
>>Our Vax 750 running 4.2BSD has occassionally been crashing with:
>>machine check 2: cp tbuf par fault
>>	va 80039728 errpc 8000394e mdr a smr 8 rdtimo 0 tbgpar 0 cacherr 5
>>	busserr 6 mcesr 9 pc 8000394e ps1 40c0008 mcsr 80016

>In article <4891 at mimsy.UUCP>, chris at mimsy.UUCP (Chris Torek) writes:
>>There are two interrelated fixes for this.  Both are already in
>>4.3BSD.  The first is that some tbuf parity errors can be corrected [...]

In article <1419 at cit-vax.Caltech.Edu> mangler at cit-vax.Caltech.Edu
(System Mangler) writes:
>Read the registers.  This is a cache parity error, not a tbuf parity
>error.  Never mind that 4.[23] doesn't distinguish between the two.

Sure enough.  I never bothered to read the bits, knowing that `this
occurs all the time and is always a tbuf error'.

>We get these all the time.  There are two ways to "fix" it:  swap
>L0003 boards until you get a good one ($$$), or change the machine
>check handler to flush the cache and return.  Now, can anyone tell
>me how to flush the cache?

Maybe the microcode fix helps this too?  I have never seen a cache
error here (but tb errors were extremely rare too: probably a
consequence of our ordering our 750s with Ultrix 1.0 way back when.)

Anyway, you could try disabling the cache:

	mtpr(CADR, 1);	/* CADR is register 0x25 */

but that will probably slow the machine to a crawl.  Disabling
and reenabling the cache might well flush it, though.  If

	mtpr(CADR, 1);
	mtpr(CADR, 0);

does not clear the problem, perhaps reenabling it after a long
delay will:

	mtpr(CADR, 1);
	timeout(cacheenable, (caddr_t) 0, 10*hz);
	...

cacheenable()
{

	mtpr(CADR, 0);
}

But according to the registers I can read above (DEC's latest VAX
Hardware Handbook does NOT include machine check frames---why?),
returning may not help too much in this case, because the machine
check error summary register (mcesr) has bit 8 set, bus error.
Returning to the failed instruction may well not retry the failed
read.  Since it occurred in kernel mode, that might bring the
machine down anyway.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris at mimsy.umd.edu