Help on deciphering crash
Chris Torek
chris at mimsy.UUCP
Wed Dec 31 12:29:37 AEST 1986
In article <3645 at sdcrdcf.UUCP> davem at sdcrdcf.UUCP (David Melman) writes:
>Our Vax 750 running 4.2BSD has occassionally been crashing with:
>machine check 2: cp tbuf par fault
[lots of registers]
>panic: mchk
>panic: sleep
There are two interrelated fixes for this. Both are already in
4.3BSD. The first is that some tbuf parity errors can be corrected
by flushing the translation buffer. As I recall, 4.2 has code to
do this, but has the wrong test to determine whether it will suffice,
masking with an 0xf somewhere where it should be masking with 0xe.
The second is a `jelloware' (writable control store) fix for a
timing problem in one CPU module. The 4.3 boot program knows to
load the file `pcs750.bin' into the 750 patch store. The code to
do this is not terribly large, and is all contained in /sys/stand/boot.c
at your nearest 4.3 site, which also has /pcs750.bin.
Incidentally, the `panic: sleep' is due to a bug in sleep that
affects things only after a previous panic. I fixed this in our
4.2 kernels back when Jim O'Toole and I were writing a kernel XNS.
I was rather amused to find the very same fix in the 4.3-alpha
kernel. It helps considerably when you crash your machine several
times a day!
Also incidentally, the 4.3 boot program has no way to avoid loading
the /pcs750.bin file, something I consider a bug (now that I have
been bit by it). We recently had a 750 go down for two weeks.
The long downtime was caused by three virtually simultaneous
failures. First, one of two CDC9771 HDAs died suddenly. Second,
our standby disk system (two RK07s) had some sort of controller
backplane problem (considering how often we use the RK07s, it may
have developed long ago). Third, and only discovered last Friday,
our WCS board went out at the same time as the HDA. As long as I
did not load the microcode update, the machine would boot. With
the microcode in place, the machine would hang completely: not even
control-P did anything. While this hardware failure might be quite
rare, it forced me to consider what would happen if part of
/pcs750.bin were overwritten. I added another boot flag to
prevent the microcode update.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP: seismo!mimsy!chris ARPA/CSNet: chris at mimsy.umd.edu
More information about the Comp.unix.questions
mailing list