micro-optimizing loops (was Help with casts)

Sun Feb 24 08:37:25 AEST 1991

[the original loop:
	for (i = 0; i < 100; i++)
		x += i;
]

In article <409 at ceco.ceco.com> garry at ceco.ceco.com (Garry Garrett)
suggests rewriting the loop as
>>> 		for (i=99; i; i--)
>>> 			x += i;

>In article <339 at smds.UUCP> rh at smds.UUCP (Richard Harter) suggests instead:
>> 		for (i=100;--i>=0;)
>> 			x += i;

In article <414 at ceco.ceco.com> garry at ceco.ceco.com (Garry Garrett) writes:
>	I must disagree with you.  For starters, consider the situation where
>i == 0.  Your loop adds 0 to x.

(True; we could use

	for (i = 100; --i > 0;)
		x += i;

instead, or equivalently `--i != 0', etc.)

>Secondly, as my discussion that you deleted pointed out, performing
>"--i>=0" would take more time than "i"

`No it won't!  Yes it will!  No it won't!  Yes it will!'  :-)

>The reason is that both i and 0 must be loaded into registers,
compared, and a branch instruction executed for your version.  My version
>loads i into a register and performs a branch.

This turns out not to be the case.

There are a number of common `dumb' implementations for the two loops

	for (i = 100; --i > 0;)	/* loop 1 */
		x += i;

and

	for (i = 99; i; i--)	/* loop 2 */
		x += i;

and they all depend on the compiler's and machine's characteristics.
(Note: for the discussion below I have assumed that `i' goes in a
machine register.  If not, the picture becomes even more complicated,
as the loops then depend on which ALU operations, if any, are permitted
on memory operands.)

Most currently-popular machines can be placed in exactly one of these
two categories: `has condition codes' or `does not have condition
codes'.  The former can be further divided into `has condition codes
that are or can be set on arithemtic operations such as decrement' and
`has condition codes that can only be set by test instructions'.  The
latter can be divided into `has branch on positive instruction' and
`requires comparison against register'.

(I am deliberately leaving modern RISCs out of the first part of this
article.)

On the condition-code machines, the first loop tends to compile to one of:

	# cc machine, compiler leaves test at top
		mov	$100, r0	# i = 100
	loop:
		dec	r0		# decrement and set condition codes
		jle	end		# exit loop if i <= 0
		<code>			# x += i, or whatever
		jmp	loop		# continue loop
	end:
	[3 instructions overhead]

or:

		mov	$100, r0
	loop:
		dec	r0		# decrement but do not set cc's
		tst	r0		# set cc's
		jle	end
		<code>
		jmp	loop
	end:
	[4 instructions overhead]

or:

	# cc machine, compiler copies test to bottom
		mov	$100, r0
		dec	r0
		# optional: tst r0
		jle	end
	loop:
		<code>
		dec	r0
		# optional: tst r0
		jgt	loop
	end:
	[2 or 3 instructions overhead]

Thus, the loop itself contains either three or four `overhead' instructions
if the compiler does not copy the test to the bottom, but only two or three
if it does.  If the compiler is smarter, it also notices that i=100,--i
gives i=99, which is > 0, which means the top `jle' is unnecessary and
the `mov $100, r0; dec r0' sequence can be simplified to `mov $99, r0'.

	# non-cc, branch on 0-compare; test left at top
		mov	$100, r0
	loop:
		dec	r0
		jle	r0, end		# jump if r0 <= 0
		<code>
		jmp	loop
	end:
	[3 instructions overhead]

or:

	# non-cc, branch on reg-compare; test left at top
		mov	$100, r0
		mov	$0, r1
	loop:
		dec	r0
		jle	r0, r1, end	# jump if r0 <= r1
		<code>
		jmp	loop
	end:
	[3 instructions overhead]

or (if the compiler is seriously stupid):

		mov	$100, r0
	loop:
		dec	r0
		mov	$0, r1
		jle	r0, r1, end
		<code>
		jmp	loop
	end:
	[4 instructions overhead]

or:

	# non-cc, branch on 0 compare; test copied to bottom:
		mov	$100, r0
		dec	r0
		jle	r0, end
	loop:
		<code>
		dec	r0
		jgt	r0, loop
	[2 instructions overhead]
	end:

or the same with the `mov $0, r1' added and the branches changed to
`jxx r0, r1, end|loop' [3 instructions overhead].

Thus, the compiler tends to emit 2, 3, or 4 instructions of loop
overhead, but you only get 4 if the compiler does not copy the
test to the bottom of the loop.

For the second loop [for (i = 99; i; i--)], we get one of these:

	# cc machine, does not matter whether arithmetic sets cc's;
	# compiler leaves test at top
		mov	$99, r0
	loop:
		tst	r0		# set condition codes
		jeq	end		# exit loop if i==0
		<code>
		dec	r0
		jmp	loop
	end:
	[4 instructions of overhead]

or:

	# cc machine; compiler copies test to bottom
	# (compiler moves test to bottom)
		mov	$99, r0
		tst	r0
		jeq	end
	loop:
		<code>
		dec	r0
		# optional tst r0
		jne	loop		# repeat if not 0
	end:
	[2 or 3 instructions of overhead]

On the non-cc machine, we get one of:

	# non-cc, test left at top
		mov	$99, r0
		# optional mov $0, r1
	loop:
		jeq	r0, end		# or jeq r0, r1, end
		<code>
		dec	r0
		jmp	loop
	end:
	[3 instructions overhead]

or:

	# non-cc, test left at top, really dumb compiler
		mov	$99, r0
	loop:
		mov	$0, r1
		jeq	r0, r1, end
		<code>
		dec	r0
		jmp	loop
	end:
	[4 instructions overhead]

or:

	# non-cc, test copied to bottom
		mov	$99, r0
		# optional mov $0, r1
		jeq	r0, end		# or jeq r0, r1, end
	loop:
		<code>
		dec	r0
		jne	r0, loop	# or jne r0, r1, end
	end:
	[2 instructions overhead]

or:

		mov	$99, r0
		mov	$0, r1
		jeq	r0, r1, end
	loop:
		<code>
		dec	r0
		mov	$0, r1
		jne	r0, r1, end
	[3 instructions overhead]

Again, we get 2, 3, or 4 instructions of overhead, with 2/3 occuring
when the compiler moves the test to the bottom and 4 occurring when it
leaves the test at the top.

>2 steps vs. your 4 steps.

Nope. :-)

Now consider modern RISCs, in which pipeline delays are exposed, i.e.,
you get one `free' instruction after a branch.  Then:

	for (i = 99; i; i--)
		code;

can be compiled as:

		mov	$99, r1		# r0 always = 0
		# compiler does not bother to test 99 != 0
	loop:
		<code>
		jne	r1, r0, loop	# branch if not 0
		dec	r1		# do this whether we branch or not
		# out of the loop now, but r1 = -1;
		# if it is used again, we must first set it back to 0.

On the other hand,

	for (i = 100; --i > 0;)
		code;

can be compiled as:

		mov	$99, r1
		mov	$1, r2
	loop:
		<code>
		jgt	r1, r2, loop	# branch if i > 1
		dec	r1		# and always decrement i

Both of these loops have exactly two instructions of overhead.  (Note
that the replacement of `--i > 0' with `i-- > 1' is legal since the
only case where i-- < 1 but --i > 0 is when i==MININT before the
decrement; i becomes MAXINT after the decrement, and this is an
underflow, and the ANSI standard says all bets are off on underflow.
Additionally, in this case, i is a loop induction variable and is known
to occupy the range [99..1].  RISC compilers tend to do this sort of
optimization.)  A good RISC compiler may even choose to transform one
of these loops to the other.

Of course, the best optimization for:

	for (i = 1; i < 100; i++)
		x += i;

is:

	x += 4950;

which will typically be several hundred times faster than anyone's
micro-optimization.

(On another topic entirely:)
>	Bear in mind that C is one of the hardest languages in the world to
>optimize.

Not really.  Try optimizing ML, or Ada, or . . . .
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek at ee.lbl.gov