Magic Numbers (and incredible stupidity in "cpio")

Guy Harris guy at sun.uucp
Sat Dec 7 16:03:45 AEST 1985


> Executables using ``standard'' binary formats, i.e. a.out (PDP-11, Z8000)
> and b.out (MC68000) use the standard magic numbers 0405, 0407, 0410, 0411.
> Non-standard formats, like Xenix x.out (0x0206) and COFF (flames to
> /dev/null; most systems are [ab].out) use distinctive magic numbers.

Well, VAX UNIX (32V, 4.xBSD, System III, Version 8?) also uses those magic
numbers (with 413 added for demand paged executables on 4.xBSD), and
probably lots of other 4.xBSD systems (Sun's does).  Does "most" mean "most
UNIX implementations" or "most boxes running UNIX"?  If the latter, I think
Xenix is running on a lot of systems, possibly most.  Then again, *my* copy
of "Xenix(TM) Standard Object File Format (January 1983)" implies that that
"0x0206" is the "magic number" and is *not* distinctive; the "x_cpu" field
indicates what CPU it's intended for.  (This is sort of like the new Sun
UNIX 3.0 object file format, where the "a_machtype" field indicates whether
it's intended for a 68010 or 68020).

COFF seems to invert this, since the "file header" indicates what machine
it's intended for (and tons of other glop) and the "UNIX header" (which is
basically the old a.out header) has the 0405, 0407, 0410, 0411, and 0413
(yes, that's what they use for paged executables, surprise surprise) which
indicates the format of the image but is machine-independent (modulo byte
ordering).  Then again, the "file header" magic number seems to indicate
something about the format of the executable, but see a previous posting of
mine for some dyspepsia caused by the proliferation of multiple file header
magic numbers.

> There are other magic numbers.  Old-style archives (ar) have 0177545 as a
> magic number; again, the loader knows about this, since a library is an
> archive.  System V archives begin with the magic ``number'' "!<arch>\n".

System V, Release 2 archives, anyway; System V Release 1 had a portable
archive format which was different from the 4.xBSD one which was the first
one to use the "!<arch>\n" magic "number".  I'm told they came to their
senses because Version 8, being 4.1BSD-based, used that format.

> Cpio archives also have magic numbers in them, but at the archive-member
> level.

No, it has a magic number at the beginning - 070707 (either as a "short" or
a string, depending on whether it's an old cruddy "cpio" archive or a nice
new "gee, we've finally caught up with 'tar' when it comes to portability"
"cpio -c" archive.  (S3 had "-c", but it had a bug so it wasn't really
portable.  S5 fixed this bug.  S5 also broke the byte-swapping garbage:

	S3 had an option to swap the bytes within 2-byte quantities.
	Presumably, this was because running the tape through "dd" to
	byte-swap *everything*, and then byte-swapping the data and
	pathnames inside "cpio", thus swapping the binary portion of the
	header once and everything else twice, is obviously more efficient
	than just swapping the binary portion of the header once.  ("cpio"
	already has hacks to deal with 4-byte quantities - namely,
	file size and modified time - automagically, by shoving "1L" into
	a "long" and seeing whether the 0th byte of that "long" is 0 or
	not, so PDP-11s and VAXes don't have problems.)  It is also
	obvious that forcing the user to specify a byte-swapping option
	is better than just looking at the magic number and seeing whether
	it's 070707 or a byte-swapped 070707 and deciding whether to
	swap or not based on that.

	Whoever worked on "cpio" for S5 obviously figured that the
	purpose of this byte-swapping crap was to make it possible to
	move binary data between machines with different byte orders
	(as everybody knows, most files with binary data are continuous
	streams of 2-byte or 4-byte quantities), not to provide a gross
	and kludgy way of byte-swapping the binary portion of a "cpio"
	header, so they added an option to swap the 2-byte portions
	of 4-byte quantities ("stupid FP-11", to quote - if I remember
	correctly - the VAX System III linker, that particular piece of
	DEC hardware being responsible for some PDP-11 software, including
	but *NOT* limited to UNIX, having a different format for 32-bit
	integers than the VAX's hardware supports) and an option to
	swap both bytes and 2-byte quantities.  They also "fixed" it
	not to swap the bytes of the pathnames.  This "fix" means that
	running the "cpio" archive through "dd" to swap the bytes, and
	then doing a byte swap again in "cpio", results in path names
	with their bytes swapped!  ("/nuxi", anyone?)  In effect, you
	are now screwed if you have a "cpio" tape, not made with "-c",
	which was produced on a machine with a different byte order.
	You can't read it in conveniently.  (This has been experimentally
	verified.  I had to whip up a version of "cpio" which does what
	"cpio" should have done in the first place - namely, just byte
	swap the damn "short"s in the header - to read a tape made on
	a System V VAX using the System V "cpio" on a Sun.))

There are a number of quite intelligent and talented people working on UNIX
development at AT&T Information Systems.  It looks like the people in charge
of keeping track of COFF magic numbers, and in charge of "cpio", are in need
of some supervision by the aforementioned people.  (Fortunately, it looks
like the IEEE P1003 committee is looking at a "tar"-based format, with fixes
to support storing information about directories and special files, for
tapes.  I'm told that the European UNIX vendor consortium, X/OPEN, chose a
"cpio" format because of the "cpio" *program*'s byte-swapping
"capabilities".  Aside from the basic stupidity (and incorrectness, in the
case of the S5 "cpio") of these "capabilities", they are irrelevant to the
choice of tape *format* because:

	1) "tar" doesn't need byte-swapping options because the
	   control information is in printable ASCII string format
	   (any tape controller which is good as anything other than
	   a target for skeet-shooting will write character strings
	   in memory out to the tape in character-string order)

	2) "cpio" has the "-c" option which does the same thing,
	   so it doesn't need those options except for reading old
	   tapes (any reasonable "cpio"-format-based standard would
	   be based on "cpio -c" format, not "cpio" format),

and
	3) a *good* program which handles "cpio" format can figure
	   out the byte order it needs for reading pre-"cpio -c"
	   tapes by looking at the magic number anyway!

(Flame off, until next time a collection of stupidities this gross comes to
light.)

	Guy Harris



More information about the Comp.unix.wizards mailing list