micro-optimizing loops (was Help with casts)
Chris Torek
torek at elf.ee.lbl.gov
Sun Feb 24 08:37:25 AEST 1991
[the original loop:
for (i = 0; i < 100; i++)
x += i;
]
In article <409 at ceco.ceco.com> garry at ceco.ceco.com (Garry Garrett)
suggests rewriting the loop as
>>> for (i=99; i; i--)
>>> x += i;
>In article <339 at smds.UUCP> rh at smds.UUCP (Richard Harter) suggests instead:
>> for (i=100;--i>=0;)
>> x += i;
In article <414 at ceco.ceco.com> garry at ceco.ceco.com (Garry Garrett) writes:
> I must disagree with you. For starters, consider the situation where
>i == 0. Your loop adds 0 to x.
(True; we could use
for (i = 100; --i > 0;)
x += i;
instead, or equivalently `--i != 0', etc.)
>Secondly, as my discussion that you deleted pointed out, performing
>"--i>=0" would take more time than "i"
`No it won't! Yes it will! No it won't! Yes it will!' :-)
>The reason is that both i and 0 must be loaded into registers,
compared, and a branch instruction executed for your version. My version
>loads i into a register and performs a branch.
This turns out not to be the case.
There are a number of common `dumb' implementations for the two loops
for (i = 100; --i > 0;) /* loop 1 */
x += i;
and
for (i = 99; i; i--) /* loop 2 */
x += i;
and they all depend on the compiler's and machine's characteristics.
(Note: for the discussion below I have assumed that `i' goes in a
machine register. If not, the picture becomes even more complicated,
as the loops then depend on which ALU operations, if any, are permitted
on memory operands.)
Most currently-popular machines can be placed in exactly one of these
two categories: `has condition codes' or `does not have condition
codes'. The former can be further divided into `has condition codes
that are or can be set on arithemtic operations such as decrement' and
`has condition codes that can only be set by test instructions'. The
latter can be divided into `has branch on positive instruction' and
`requires comparison against register'.
(I am deliberately leaving modern RISCs out of the first part of this
article.)
On the condition-code machines, the first loop tends to compile to one of:
# cc machine, compiler leaves test at top
mov $100, r0 # i = 100
loop:
dec r0 # decrement and set condition codes
jle end # exit loop if i <= 0
<code> # x += i, or whatever
jmp loop # continue loop
end:
[3 instructions overhead]
or:
mov $100, r0
loop:
dec r0 # decrement but do not set cc's
tst r0 # set cc's
jle end
<code>
jmp loop
end:
[4 instructions overhead]
or:
# cc machine, compiler copies test to bottom
mov $100, r0
dec r0
# optional: tst r0
jle end
loop:
<code>
dec r0
# optional: tst r0
jgt loop
end:
[2 or 3 instructions overhead]
Thus, the loop itself contains either three or four `overhead' instructions
if the compiler does not copy the test to the bottom, but only two or three
if it does. If the compiler is smarter, it also notices that i=100,--i
gives i=99, which is > 0, which means the top `jle' is unnecessary and
the `mov $100, r0; dec r0' sequence can be simplified to `mov $99, r0'.
# non-cc, branch on 0-compare; test left at top
mov $100, r0
loop:
dec r0
jle r0, end # jump if r0 <= 0
<code>
jmp loop
end:
[3 instructions overhead]
or:
# non-cc, branch on reg-compare; test left at top
mov $100, r0
mov $0, r1
loop:
dec r0
jle r0, r1, end # jump if r0 <= r1
<code>
jmp loop
end:
[3 instructions overhead]
or (if the compiler is seriously stupid):
mov $100, r0
loop:
dec r0
mov $0, r1
jle r0, r1, end
<code>
jmp loop
end:
[4 instructions overhead]
or:
# non-cc, branch on 0 compare; test copied to bottom:
mov $100, r0
dec r0
jle r0, end
loop:
<code>
dec r0
jgt r0, loop
[2 instructions overhead]
end:
or the same with the `mov $0, r1' added and the branches changed to
`jxx r0, r1, end|loop' [3 instructions overhead].
Thus, the compiler tends to emit 2, 3, or 4 instructions of loop
overhead, but you only get 4 if the compiler does not copy the
test to the bottom of the loop.
For the second loop [for (i = 99; i; i--)], we get one of these:
# cc machine, does not matter whether arithmetic sets cc's;
# compiler leaves test at top
mov $99, r0
loop:
tst r0 # set condition codes
jeq end # exit loop if i==0
<code>
dec r0
jmp loop
end:
[4 instructions of overhead]
or:
# cc machine; compiler copies test to bottom
# (compiler moves test to bottom)
mov $99, r0
tst r0
jeq end
loop:
<code>
dec r0
# optional tst r0
jne loop # repeat if not 0
end:
[2 or 3 instructions of overhead]
On the non-cc machine, we get one of:
# non-cc, test left at top
mov $99, r0
# optional mov $0, r1
loop:
jeq r0, end # or jeq r0, r1, end
<code>
dec r0
jmp loop
end:
[3 instructions overhead]
or:
# non-cc, test left at top, really dumb compiler
mov $99, r0
loop:
mov $0, r1
jeq r0, r1, end
<code>
dec r0
jmp loop
end:
[4 instructions overhead]
or:
# non-cc, test copied to bottom
mov $99, r0
# optional mov $0, r1
jeq r0, end # or jeq r0, r1, end
loop:
<code>
dec r0
jne r0, loop # or jne r0, r1, end
end:
[2 instructions overhead]
or:
mov $99, r0
mov $0, r1
jeq r0, r1, end
loop:
<code>
dec r0
mov $0, r1
jne r0, r1, end
[3 instructions overhead]
Again, we get 2, 3, or 4 instructions of overhead, with 2/3 occuring
when the compiler moves the test to the bottom and 4 occurring when it
leaves the test at the top.
>2 steps vs. your 4 steps.
Nope. :-)
Now consider modern RISCs, in which pipeline delays are exposed, i.e.,
you get one `free' instruction after a branch. Then:
for (i = 99; i; i--)
code;
can be compiled as:
mov $99, r1 # r0 always = 0
# compiler does not bother to test 99 != 0
loop:
<code>
jne r1, r0, loop # branch if not 0
dec r1 # do this whether we branch or not
# out of the loop now, but r1 = -1;
# if it is used again, we must first set it back to 0.
On the other hand,
for (i = 100; --i > 0;)
code;
can be compiled as:
mov $99, r1
mov $1, r2
loop:
<code>
jgt r1, r2, loop # branch if i > 1
dec r1 # and always decrement i
Both of these loops have exactly two instructions of overhead. (Note
that the replacement of `--i > 0' with `i-- > 1' is legal since the
only case where i-- < 1 but --i > 0 is when i==MININT before the
decrement; i becomes MAXINT after the decrement, and this is an
underflow, and the ANSI standard says all bets are off on underflow.
Additionally, in this case, i is a loop induction variable and is known
to occupy the range [99..1]. RISC compilers tend to do this sort of
optimization.) A good RISC compiler may even choose to transform one
of these loops to the other.
Of course, the best optimization for:
for (i = 1; i < 100; i++)
x += i;
is:
x += 4950;
which will typically be several hundred times faster than anyone's
micro-optimization.
(On another topic entirely:)
> Bear in mind that C is one of the hardest languages in the world to
>optimize.
Not really. Try optimizing ML, or Ada, or . . . .
--
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA Domain: torek at ee.lbl.gov
More information about the Comp.lang.c
mailing list