Reliability of (Sys V) file systems on power failure

Sat Sep 29 03:08:50 AEST 1990

In article <1990Sep26.192446.22110 at ico.isc.com> rcd at ico.isc.com (Dick Dunn) writes:
>I had opined that you shouldn't see file system damage on a power hit, and
>also noted that I hadn't seen damage (beyond files being written during the
>hit) for quite a few years.
....
>> OK, so why did my /etc/default/boot file get whacked a few months back when
>> we had a power failure?
>...
>> (For the unknowing, lacking an /etc/default/boot file, which is READ ONLY,
>> you can't boot the machine!)
>
>"Whacked" is a little too technical for me just yet.  Do you mean that it
>ended up empty, or missing entirely?  After you recreated it and got the
>system back up, did anything like the boot file show up in lost+found?
>Was the rest of /etc/default OK, or did it take out the whole directory?

Whacked means that fsck unlinked it entirely; the file was gone.  It did not
show up in lost+found; it was "cleared".  

If it wasn't for my figuring out what happened (for the uninitiated, you
never want to be in this position; you don't get to see the error message
about the file being missing, the machine just doesn't come up with no
apparent cause) and recreating it from a boot floppy, I would have had to
reload the entire OS.  As it was I lost a couple of hours figuring out why
my machine wouldn't come up and fixing it (the majority of that time was the
figuring out part).

>Here's what I'm trying to get at:  If the file was corrupted or gone,
>something got written that shouldn't have been written.  The first task is
>to find out what got written.  The sort of reasoning goes like this:
>	- If /etc/default (the directory containing boot) got corrupted,
>	  I'd want to know what ended up there, because that directory
>	  shouldn't be subject to change during "normal" system operation.
>	- If the inode for boot got corrupted, you'd expect a chunk of
>	  inodes (one disk sector) to get it...and it's likely that other
>	  files would be hit also.  The boot parameter file is likely to
>	  share its inode sector with other files that are "important" but
>	  seldom modified.  An access-time update could have been in
>	  progress when the power failed.  If it toasted a full sector,
>	  you'd expect to see other important files damaged or gone.

The boot file itself was good, as the system did give the "Booting" message
and once the default file was put back, all was well.

Also note that the physical format on the disk was just fine; normally if
power is interrupted >during< a write and you get damage the sector will
then have a "hard" error on it.  This was not the case.

>> Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic
>> that disables the write gate when power goes out of safe margins).
>
>Sounds good so far.  What's the box?  If you've built it up from parts,
>then what's the motherboard?  As you can guess, I don't yet see cause to
>say that either hardware or software is either guilty or innocent.

Compaq'386, and we've seen the same kind of problem with an AT&T 6386 with
the same (and different) disk/adapter combinations.  A collegue of mine who
sits across the hall has had many files killed or corrupted (like X11 files
which are normally read-only, /etc/netd.cf, minor things like that) from
power failures.  We finally gave up and put UPSs on both machines; that has
stopped the insanity.

I've seen this same failure mode with MFM, RLL, ESDI and SCSI disks across
lots of different platforms -- but only with ISC OSs.

>Again, if something got corrupted, it means that something got written
>that shouldn't have been written.  The problem--and it's NOT likely to be
>an easy one--is to find out what was written wrong.  That's likely to give
>a clue whether it's hardware or software (or a conspiracy of the two:-).

Yep.  I've seen this with every ISC release since the dawn of time, and have
NEVER seen this kind of problem on identical hardware with SCO Xenix (don't
know about SCO Unix, haven't run it for any length of time).  Examples have
been had from 1.0.6, 2.0, 2.0.1, 2.0.2, and now 2.2.  There's something
stinky in that "bitmapped filesystem monster" that is used in the ISC
system.  Yes, it does speed up the filesystem.  It's also dangerous without
power protection.  It doesn't bite you all the time, but it does get you 
often enough to make a UPS a mandatory part of all ISC systems unless you
like testing the integrity of your backup media under fire.

>> Ok Mr. Dunn, the gauntlet has been thrown down.  If you want details of the
>> failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these
>> hits) you're welcome to call me here.  
>
>I don't follow the connection to SunOS4.1--correct me if I'm wrong, but I
>didn't think there was a hardware platform common to ISC's Sys V.3.2 and
>SunOS.  (386i???)

We have lots of Sun machines here at the same location; they take the same
power plunge that the rest of the gear does.  Never has one of these
machines been burned when there is a problem with power.  Files which are
open for writing often do get lunched, yes, but that's a risk with ANY
filesystem.  Read-only parameter files, kernels, etc. have never been
damaged on the Suns.

They have, many times, on ISC.  American Power Conversion loves us; we're
using lots of their UPS systems to back up the Compaqs.

>(My OS???  Let's clarify:  I do use ISC systems, both at work and at home.
>I'm taking an interest in this because I want to know how and why the
>failures you've seen can happen--it's an important question.  But I'm not
>speaking for ISC on the net.)

Ok... that was a misunderstanding on my part.  I have, however, heard you
champion the company's products more than once here.

To make it short and sweet -- if you run ISC, make sure you have a UPS or
risk the loss of your filesystems.

--
Karl Denninger	AC Nielsen
kdenning at ksun.naitc.com
(708) 317-3285
Disclaimer:  Contents represent opinions of the author; I do not speak for
	     AC Nielsen on Usenet.