4.2 \"soft ecc\" errors
Chris Torek
chris at umcp-cs.UUCP
Tue Oct 7 11:04:27 AEST 1986
(Since I have seen no summary of replies, and since I can answer
most of these, I shall ignore the `reply by mail' request.)
In article <4072 at brl-smoke.ARPA> vader!root at LBL-CSAM.arpa (RADIX System) writes:
>... I get the following error message at about 10 minute intervals:
>
> mcr0: soft ecc addr xxx syn yy
>
>I also get the following when we boot:
>
> WARNING: should run interleaved swap with >= 2MB
>
>1) How do I "run interleaved"?
This refers to swap/paging partitions. If you have two or more
disc drives, you should set up swap areas on at least two. See
`Building Systems with Config'. Multiple swap areas is supposed
to be faster. Whether it is in fact faster is a function of many
variables.
>2) Is the boot message an indication of why I am getting the other
>messages?
No.
>3) If I go back to 4.1, I don't see the "ecc" message (or the other
>one, for that matter). Is there really something wrong with my memory
>boards?
Yes. 4.1 had less support for 750s, and presumably did not catch
750 ECC errors.
>4) I have discovered that the "ecc" message is (likely) from
>/usr/sys/vax/machdep.c
It is indeed.
>and I have found several
> #if TRENDATA
> ...
> #endif
>lines. But when I defined TRENDATA as an "optional" in my kernel
>configuration file (and reboot), the same error messages continue
>to come out. Am I missing some "bugfix" code for TRENDATA memory
>on a 750? (Looks like most of the TRENDATA mods are for 780 machines.)
The Trendata tables are for specific boards, probably for 780s.
Whether they apply to yours is questionable. In any case, Trendata
should have provided you with, or be able to provide you with,
decoding tables. If Trendata understands only VMS format errors,
just concatenate `xxx' and `yy' and pad with zeroes on the left:
mcr0: soft ecc addr 54f90 syn e3
means the same as VMS's
?VMS-W-WARNINGMESSAGE, ridiculously long error string that
lets you know something is wrong, but is no more help than
`soft ecc addr ...' when it comes to figuring out just
what, but fortunately you can look it up in some manual,
which will of course just tell you to call Field Service,
ERR ADDR=054F90E3
>5) Besides risking the filling of my disk from /usr/adm/messages, is
>there any other danger in ignoring the error messages?
Yes. If another few chips fail, you will no longer get soft
(correctable) errors; you will get crashes.
Incidentally, just because you see the messages only once every
ten minutes does not mean the ECC correction is infrequent. The
code in /sys/vax/machdep.c disables ECC reporting after each
error, then re-enables it ten minutes later. This is controlled
by the variable `memintvl', which is in seconds:
% su
Password:
# adb -w /vmunix /dev/kmem
memintvl/W 1
_memintvl:
_memintvl: 258 = 1
$q
#
will re-enable reporting after one second. Stand back from the
console, and have plenty of paper handy!
Rebooting will restore the ten minute interval; or you can use adb
again to change it back.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP: seismo!umcp-cs!chris
CSNet: chris at umcp-cs ARPA: chris at mimsy.umd.edu
More information about the Comp.unix.wizards
mailing list