Reliability of (Sys V) file systems on power failure

Thu Sep 27 05:24:46 AEST 1990

I had opined that you shouldn't see file system damage on a power hit, and
also noted that I hadn't seen damage (beyond files being written during the
hit) for quite a few years.

karl at naitc.naitc.com (Karl Denninger) writes:

> Ok, I've seen filesystem damage of this type, on your Operating System
> (2.0.2), and another employee here has seen the same thing on his copy of
> ISC 2.2.
> 
> To put it bluntly, there's something wrong that should be fixed.

This sort of thing is tough to work out without a lot of detail, but since
Karl has said, "the gauntlet has been thrown down" let's see if we can make
some progress on it here.  I'm game--I don't want to find out "the hard
way" that there are ways to take major damage from a power failure, so if
Karl has seen it, I'd like to learn from it.

> OK, so why did my /etc/default/boot file get whacked a few months back when
> we had a power failure?
...
> (For the unknowing, lacking an /etc/default/boot file, which is READ ONLY,
> you can't boot the machine!)

"Whacked" is a little too technical for me just yet.  Do you mean that it
ended up empty, or missing entirely?  After you recreated it and got the
system back up, did anything like the boot file show up in lost+found?
Was the rest of /etc/default OK, or did it take out the whole directory?

Here's what I'm trying to get at:  If the file was corrupted or gone,
something got written that shouldn't have been written.  The first task is
to find out what got written.  The sort of reasoning goes like this:
	- If /etc/default (the directory containing boot) got corrupted,
	  I'd want to know what ended up there, because that directory
	  shouldn't be subject to change during "normal" system operation.
	- If the inode for boot got corrupted, you'd expect a chunk of
	  inodes (one disk sector) to get it...and it's likely that other
	  files would be hit also.  The boot parameter file is likely to
	  share its inode sector with other files that are "important" but
	  seldom modified.  An access-time update could have been in
	  progress when the power failed.  If it toasted a full sector,
	  you'd expect to see other important files damaged or gone.

> Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic
> that disables the write gate when power goes out of safe margins).

Sounds good so far.  What's the box?  If you've built it up from parts,
then what's the motherboard?  As you can guess, I don't yet see cause to
say that either hardware or software is either guilty or innocent.

Again, if something got corrupted, it means that something got written
that shouldn't have been written.  The problem--and it's NOT likely to be
an easy one--is to find out what was written wrong.  That's likely to give
a clue whether it's hardware or software (or a conspiracy of the two:-).

> Ok Mr. Dunn, the gauntlet has been thrown down.  If you want details of the
> failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these
> hits) you're welcome to call me here.  

I don't follow the connection to SunOS4.1--correct me if I'm wrong, but I
didn't think there was a hardware platform common to ISC's Sys V.3.2 and
SunOS.  (386i???)

(My OS???  Let's clarify:  I do use ISC systems, both at work and at home.
I'm taking an interest in this because I want to know how and why the
failures you've seen can happen--it's an important question.  But I'm not
speaking for ISC on the net.)
-- 
Dick Dunn     rcd at ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...Worst-case analysis must never begin with "No one would ever want..."