Explanation, please!
Henry Spencer
henry at utzoo.uucp
Mon Aug 29 10:21:01 AEST 1988
In article <dpmuY#2EBC4R=eric at snark.UUCP> eric at snark.UUCP (Eric S. Raymond) writes:
>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.
Nope. Bell Labs Research uses VAXen and 68Ks, mostly.
The key point is not pipelining, but loop-control overhead. There is in
fact a tradeoff here: unrolling the loop further will reduce control
overhead further, but will increase code size. That last is of some
significance when caching gets into the act: cache-loading overhead
favors short loops, and small cache sizes very strongly favor short ones.
In general there is an optimal point in there somewhere, and an unrolling
factor of 8 or 16 is a pretty good guess at it on the machines I've looked
at closely.
--
Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology
they just act that way. | uunet!attcan!utzoo!henry henry at zoo.toronto.edu
More information about the Comp.lang.c
mailing list