Optimal for loop on the 68020.
Jef Poskanzer
pokey at well.UUCP
Mon Jun 5 10:18:11 AEST 1989
SUMMARY
-------
I compiled the following different kinds of for loops:
for ( i = 0; i < COUNT; i++ )
for ( i = 0; i < COUNT; ++i )
for ( i = 0; ++i <= COUNT; )
for ( i = 0; i++ < COUNT; )
for ( i = COUNT; i > 0; i-- )
for ( i = COUNT; i > 0; --i )
for ( i = COUNT; --i >= 0; )
for ( i = COUNT; i-- > 0; )
on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
generated code and the timings. COUNT was a small (< 127) compile-time
constant. The loop body did not reference i and had no side-effects.
In theory, all eight of these loops should have optimized to the most
efficient loop possible, ignoring the otherwise unreferenced variable i
and simply traversing the loop body the proper number of times. On the
68020, this most efficient loop is a dbra instruction. But in fact, cc
never generates dbra, and gcc generates it for only one of the above
loop types, and only when the -fstrength-reduce flag is specified.
In contrast, the BSD cc on a vax generates a single instruction loop in
four out of the eight cases, including the two most commonly-used
ones. This is not because the vax compiler is any smarter -- the fact
that it wins only half the time, instead of all the time, shows that it
is also failing to recognize that it can freely optimize the loop. The
only reason the vax's cc does better is that the vax has a richer
instruction set.
Since the loop overhead for dbra is 1.7 times less than for the code
from the common for loops, it is useful to know how to provoke your
compiler into generating it. But it would be more useful for compilers
to be smarter.
And more generally, those of you who blithely assume (as I used to)
that compilers are smart enough to turn clearly-expressed C into
optimal assembler, should perhaps look at their assembler output more
often.
FULL DATA
---------
Case 1) First, let's look at the generic loops, counting up from zero
to the limit:
for ( i = 0; i < COUNT; i++ )
for ( i = 0; i < COUNT; ++i )
cc -O gcc -O & gcc -O -fstrength-reduce
moveq #0,d6 clrl d0
tag: tag:
<loop body> <loop body>
addql #1,d6 addql #1,d0
moveq #COUNT,d1 moveq #COUNT-1,d3
cmpl d1,d6 cmpl d0,d3
jlt tag jge tag
0.97 usecs 0.97 usecs
The two compilers generate essentially the same instructions. Pre or
post increment doesn't make any difference. Note that the moveq of
COUNT could be moved above the loop -- the loop body doesn't doesn't
call any subroutines that might smash it. This looks like a missed
optimization.
Case 2a) Second, let's try doing the increment and test in one C
statement. Seems to me this shouldn't make any difference to a smart
optimizer, but...
for ( i = 0; ++i <= COUNT; )
cc -O gcc -O & gcc -O -fstrength-reduce
moveq #0,d6 clrl d0
jra tag2 jra tag2
tag1: tag1:
<loop body> <loop body>
tag2: tag2:
addql #1,d6 addql #1,d0
moveq #COUNT,d1 moveq #COUNT,d3
cmpl d1,d6 cmpl d0,d3
jle tag1 jge tag1
0.97 usecs 0.97 usecs
No important difference from case 1.
Case 2b) But if we try the post increment version:
for ( i = 0; i++ < COUNT; )
cc -O gcc -O & gcc -O -fstrength-reduce
moveq #0,d6 clrl d0
jra tag2 jra tag2
tag1: tag1:
<loop body> <loop body>
tag2: tag2:
movl d6 d0 addql #1,d0
addql #1,d6 moveq #COUNT+1,d3
moveq #COUNT,d1 cmpl d0,d3
cmpl d1,d0 jgt tag1
jlt tag1
1.11 usecs 0.97 usecs
Gcc is smart enough to bias the loop count, but cc doesn't see that
optimization and so adds an extra instruction.
Case 3) Third, let's try a count-down loop:
for ( i = COUNT; i > 0; i-- )
for ( i = COUNT; i > 0; --i )
cc -O gcc -O & gcc -O -fstrength-reduce
moveq #COUNT,d6 moveq #COUNT,d0
tag: tag:
<loop body> <loop body>
subql #1,d6 subql #1,d0
tstl d6 tstl d0
jgt tag jgt tag
0.83 usecs 0.83 usecs
Here the two compilers generate exactly the same instructions. There
is one less instruction than in the count-up case, because the
compilers are smart enough to use the tstl instruction to compare
against zero, and so they don't need the moveq #COUNT instruction
(which shouldn't have been in the loop in the first place).
Case 4a) Fourth, let's try a count-down loop with the decrement
included in the test statement:
for ( i = COUNT; --i >= 0; )
cc -O gcc -O gcc -O -fstrength-reduce
moveq #COUNT,d6 moveq #COUNT,d0 moveq #COUNT,d0
jra tag2 jra tag2 jra tag2
tag1: tag1: tag1:
<loop body> <loop body> <loop body>
tag2: tag2: tag2:
subql #1,d6 subql #1,d0 dbra d0,tag1
jge tag1 jpl tag1 clrw d0
subql #1,d0
jcc tag1
0.70 usecs 0.70 usecs 0.57 usecs
Cc and gcc generate code similar to case 3, except that they suddenly
decide that subql sets the condition codes just fine by itself, and a
separate tstl is not necessary. So they shave off an instruction.
Seems like a peephole optimizer should have picked this up in the
previous cases too; it's another missed optimization.
But the big win here is with the -fstrength-reduce option in gcc. It
actually generates the optimal one-instruction loop, for a factor of
1.7 reduction in loop overhead from the generic loop. Not bad. But
wait! What's that chud after the loop? Let's see, clear d1 to zero,
subtract one from it giving -1 and setting carry, and jump if carry is
clear. Hmm, looks like a three-instruction no-op to me! I guess gcc
wants to leave the loop index with the right value, and isn't smart
enough to notice it isn't used. (But why not simply moveq the final
value?) Oh well, since it's not inside the loop it doesn't make much
difference.
Case 4b) But if we try this with post decrement:
for ( i = COUNT; i-- > 0; )
cc -O gcc -O & gcc -O -fstrength-reduce
moveq #COUNT,d6 moveq #COUNT,d0
jra tag2 jra tag2
tag1: tag1:
<loop body> <loop body>
tag2: tag2:
movl d6,d0 subql #1,d0
subql #1,d6 moveq #-1,d3
tstl d0 cmpl d0,d3
jgt tag1 jlt tag1
0.97 usecs 0.97 usecs
Cc not only adds in the movl to save the unincremented value of i, it
also somehow fails to realize that subql sets the condition codes, so
the tstl is back. And gcc gets totally confused and brings in a
literal -1.
CONCLUSION
----------
For both compilers and all levels of optimization, this loop:
for ( i = COUNT; --i >= 0; )
gives the lowest overhead. Using gcc -fstrength-reduce, this low
overhead means the optimal single-instruction loop; but even for cc and
for gcc without -fstrength-reduce, this loop is faster than the
others. I don't show the results here, but this is also true for large
constant loops and for variable-length loops.
It is unfortunate that we have to use such a non-intuitive loop to get
the best results -- especially unfortunate since it is probably *not*
the best loop on some other architectures or compilers -- but that's
the facts.
I would be interested to see similar checks done on different
architectures; in particular RISC machines, which supposedly are
designed in cooperation with the compiler writers.
---
Jef
Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
"Man invented language to satisfy his deep need to complain." -- Lily Tomlin
More information about the Comp.unix.wizards
mailing list