Unrolling string copy loops

Thu Apr 4 23:43:29 AEST 1985

>>>>
Having noticed a discussion of the benefit of loop unrolling on
string copy (and other functions),  I thought I'd share a similar
experience here as it gave us BIG gains.

The Sperry 1100 mainframe, on which a version of the UNIXtm system
has been running since 1979, is a WORD ADDRESSABLE machine (and the
words are 36 bit 1's complement).  Needless to say, implementing a
C compiler is somewhat interesting, especially in the area of char
pointer dereferencing.  At run time, the 20 bit psuedo-byte pointer
is split into its word and "byte" components, and then the proper
partial word is loaded from memory.  This multi-instruction sequence
is much more expensive than on your usual machine.

Enter loop unrolling.  Our large project (>1Mil lines C code) was
profiled and found to use lots of time in the str*() functions.
Noticing that the str* functions are sequentially processing their
arguments (char 0, then 1, ..., then n), you can determine the starting
partial word (1st 9 bits, 2nd, 3rd, or 4th) once and then predict
what the next 9 bits you need are going to be (2nd, 3rd, 4th, or
1st from next word).  For strcpy, you create a 4 by 4 table of
entry points and away you go.

Moral of the story - this technique cut the cpu cost of the str*()
functions by 90% (they were already quite expensive), never to be
seen again on our cpu profiles.  Loop unrolling will work on other
normal machines also since you process *cp, *(cp+1), *(cp+2),
etc. at the cost of a few extra words of memory (because you're
duplicating the load/store sequence with different offsets from your
original cp pointer which you put in a register beforehand).