Optimal for loop on the 68020.

Sun Jun 11 06:45:57 AEST 1989

In article <11993 at well.UUCP>, pokey at well.UUCP (Jef Poskanzer) writes:
> SUMMARY
> -------
> 
> I compiled the following different kinds of for loops:
> 
>     for ( i = 0; i < COUNT; i++ )
>     for ( i = 0; i < COUNT; ++i )
>     for ( i = 0; ++i <= COUNT; )
>     for ( i = 0; i++ < COUNT; )
>     for ( i = COUNT; i > 0; i-- )
>     for ( i = COUNT; i > 0; --i )
>     for ( i = COUNT; --i >= 0; )
>     for ( i = COUNT; i-- > 0; )
> 
> on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
> generated code and the timings.  COUNT was a small (< 127) compile-time
> constant.  The loop body did not reference i and had no side-effects.
> In theory, all eight of these loops should have optimized to the most
> efficient loop possible, ignoring the otherwise unreferenced variable i
> and simply traversing the loop body the proper number of times.  On the
> 68020, this most efficient loop is a dbra instruction.  But in fact, cc
> never generates dbra
> 
	<details omitted>
> 
> CONCLUSION
> ----------
> 
> For both compilers and all levels of optimization, this loop:
> 
>     for ( i = COUNT; --i >= 0; )
> 
> gives the lowest overhead.

I tried these eight loops on Microsoft C v5.1 and was very surprised to
get the same results, more or less.  Jef's fastest loop (--i >= 0) on his
Sun was also MSC's fastest loop on an 80386.  And for the same reason: the
compiler wakes up an realizes that the condition flags are already set.
On 80x86, JCXZ is the fast loop instruction, but it is ignored by
the MSC optimizer as well.  (To be fair, I didn't turn on any
special optimization, so I can't say MSC won't
do it under some conditions.)  Making i a register variable would obviously
improve the timings, but MSC uses the SI register, not CX, and therefore
can't take advantage of JCXZ.  The code MSC generates for (--i >= 0) is
approx. 47% of the t-states used by the code it generates for the typical
(i=0; i<COUNT; i++) loop, but the typical case was not the worst case.

What an eye-opener!  I always assumed that it wasn't worth it being fussy
about such things in C, that the compiler would optimize it perfectly.
Like Jef says, if you want a particular part of your code to run fast,
don't assume it will be optimized; study the compiler's output.

Random thought: What are our "optimizing compilers" optimizing if they 
leave loops alone?

/alastair/