Weird problems caused by corrupted system and how I rebuilt it. (long)
Augustine Cano
canoaf at ntvax.UUCP
Fri Jun 9 12:09:02 AEST 1989
Hi everyone!
The problem: Programs that used to work, started not working. At first I
attributed the problem to the latest (as far as I know) version of C-kermit,
that I got from Columbia U. hoping that the major problems would be solved.
Well, no luck. Kermit (on a 3b1) still does not exit without help from ^C,
and, most disturbing, when the following was executed from a "take" file, it
locked up so badly that the only way out was to kill its parent shell from
another window. The work-around that used to work in the previous version
did not work anymore.
set modem att
! phtoggle
set line /dev/ph0
set speed 1200
dial nnn-nnnn
connect
Well now, after rebuilding the system, it does not lock up anymore; it just
does not set the modem properly.
The funny thing is that the exact same sequence works fine when typed at
the prompt. Am I overlooking something? Is anybody having the same problem?
Is anybody using kermit on a UNIX PC? I believe that this is a genuine kermit
problem. Am I wrong?
Vtem (anyone out there using vtem, the VT100 emulator that uses the
pty's?) also acted very strangely. I compiled it under install, and when I
ran it while being install, it worked fine. Under another login it would
lock up, not even echo appeared on the screen. The only solution was
to kill its parent shell, just like kermit. Vtem also sometimes mapped the
British character set since '#' appeared as the pound-sterling symbol.
This was solved by logging out.
Finally, one day, Lenny's sysinfo program stopped working after a brief power
outage. At his suggestion, I checked whether the lipc driver was loaded and,
sure enough it wasn't.
It turned out that quite a few files were in inconsistent states. Many
libraries were different from the distribution ones, notably libc.a. This
is probably the result of trying to have shcc and ccc installed at the same
time. At least a couple of files related to loadable drivers were also
different and many ua files were inconsistent (2 entries for the same
package in installed software, pty entry remaining after it was removed, etc.)
The sysinfo problem was caused by the fact that loadavgd, when trying
to start, would dump core in /etc/lddrv (someone mentioned this symptom some
time ago) and therefore sysinfo could not communicate with it.
Rather than fixing up individual files and risk missing some, I decided that
it was time for a major overhaul. Not only will I end up with a guaranteed
consistent system, I thought, but the HD fragmentation would go way down.
The fragmentation did go down (from about 16.00 % to under 2 %) but I don't
want to have to do this again, EVER! (unless of course someone comes up with
an automated script to do it.)
At first, I thought: no big deal! just backup the whole thing, boot from
floppy and restore everything. After 10 seconds though, it became obvious
that you either restore everything unconditionally, putting back the corrupted
files where they were, or you have to reconfigure the system manually when
you're done. This would mean re-creating the groups, users, links and
configuration from scratch, as well as finding out about each and every file
that didn't come in a distribution set.
The solution:
1 - Remove all installed packages from install (it is here that I found out
about the inconsistent ua files.)
2 - Backup /u (all users): find /u -print | cpio -oBcv > /dev/rfp021.
3 - MAKE SURE THAT THE CPIO SET IS READABLE: cpio -ictB < /dev/rfp021 for
this and all future cpio sets. This might save you a lot of grief.
When I first tried to restore the whole HD, (a cpio set of 90+ floppies)
cpio just quit at disk 74. After trying a second time and failing at
disk 71, I was getting pretty paranoid about losing irreplaceable data.
It turned out that disks 71-91 were Kodak HD600 (96 TPI.) I would have
thought that better disks would have no problem at lower track densities.
Is there something inherent in the magnetism of the (thinner?) magnetic
coating or the sensitivity of a 48 TPI drive head that make use of such
floppies a hazard to your data? or did I just hit a few bad disks?
In any case, using fc, I could copy the data from the 96 TPI disks
(sometimes after many tries) to regular 48 TPI floppies. From then on
there was no problem.
4 - Make one cpio set for each directory that does not exist in the
distribution; in my case /usr/man, /usr/lbin, /usr/src, /usr/doc,
/usr/local, /usr/games.
5 - Login as root.
6 - Delete all /u files: rm -r /u (I felt really funny doing this...)
7 - Delete all the directories backed up in 4.
8 - Do a find / -newer /bin/cat -print > /tmp/modified.files. This will
make a list of all the remaining files that have been modified since
the installation of the foundation set.
9 - Print this file and go through it, deleting any files that you know are
in a package that is backed up on floppy. These files would still be
there because they were not removed in step 1, probably because they
came from a non-installable package.
10- Make a separate cpio set for each directory remaining on that list. In
my case: /bin, /etc, /lib, /usr/bin, /usr/lib, /usr/mail, /usr/spool.
Mark these clearly to the effect that they will have to be reviewed
before restoring.
11 - Reboot floppy unix and install a clean foundation set. When asked if
you want to wipe out the files on the HD, say yes. (how often do you
get to willingly destroy everything on your HD? :-)
12 - Login as install and install the appropriate installable packages in
the appropriate order: ie. Telephone, ATE, Curses/Terminfo end user
package, GSS Drivers, Dev. set, Enhanced editors, Encryption set (the
order of this one is important), etc..
13 - Login as root and restore the cpio sets made in step 4: cpio -iBdcv <
/dev/rfp021 for each of /usr/man, /usr/doc, /usr/local, etc... The
idea of restoring these before /u is that, since these files are
modified less than user files, they will stay packed and unfragmented
closer to the beginning of the disk longer. Is this reasoning correct?
14 - Make whatever links you had that were not standard: ie.
ln /bin/as /bin/mas, ln /bin/cc /bin/mcc, ln /usr/bin/compress
/usr/bin/zcat, etc...
15 - cd /tmp
16 - One by one, restore the directories saved in step 10, REDIRECTING to
the current directory: cpio -iBdcvR < /dev/rfp021
17 - For each of the directories in step 10, do: diff -r <name-of-directory>
/<name-of-directory> > <name-of-directory>.diff. This will give you
a list of which files were present in your old directory and not in
the clean one (these you want to copy to the new), which files are in
the new and not in the old (ignore these), and which are in both and if
they are different.
18 - For each of the directories in step 10, edit the file /tmp/<name-of-
directory>.diff. Delete the lines: "only in /<name-of-directory>".
Copy the files on the lines: "only in <name-of-directory>" to the new
one (/<name-of-directory>). For those that exist in both the old and the
new, you'll have to decide whether to copy them or not. Unless you know
what the file is for, and you're sure you want the old version, don't
copy it. It is better to have to do some minor configuration later on
than having a still corrupt system. In the case of /etc, the only files
I copied from the old directory (now /tmp/etc) were /etc/daemons/*,
/etc/group and /etc/passwd. It was in this step that I found out how
many corrupted or inconsistent files I really had.
19 - After finishing each of the directories in step 10, cleanup /tmp. This
will reduce external fragmentation (I think.)
20 - Install those packages that were in /usr/src.
21 - Do an unconditional restore of /u: cpio -iBdcvu < /dev/rfp021. Before
doing this I saved /u as it was laid out by the foundation set in a
/tmp file and then applied step 17 to that file. This, however is not
necessary, since no user files were modified during installation.
22 - Reboot the system, YOU'RE DONE !!!
One very minor problem is that links cannot be made across cpio sets. Cpio
could not recreate the link of one /usr/src file to a /u file since /u was
not on the same set.
The only re-configuration I had to do was to set up the printer, the phone
line and the screen blanking interval. This was done in 5 minutes and could
have been avoided had I restored the files where this information is kept.
Well, I hope this helps someone with a similar problem. Of course, if
somebody decides to automate this procedure by putting it into a script, I
would definitely like to see it. If someone has other ideas or comments on
how this process could be simplified, I would also like to hear them.
Augustine Cano canoaf at dept.csci.unt.edu
More information about the Comp.sys.att
mailing list