Explanation, please!

Mon Aug 29 10:21:01 AEST 1988

In article <dpmuY#2EBC4R=eric at snark.UUCP> eric at snark.UUCP (Eric S. Raymond) writes:
>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.

Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.

The key point is not pipelining, but loop-control overhead.  There is in
fact a tradeoff here:  unrolling the loop further will reduce control
overhead further, but will increase code size.  That last is of some
significance when caching gets into the act:  cache-loading overhead
favors short loops, and small cache sizes very strongly favor short ones.
In general there is an optimal point in there somewhere, and an unrolling
factor of 8 or 16 is a pretty good guess at it on the machines I've looked
at closely.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry at zoo.toronto.edu