VAX loops (was Re: Optimal for loop on the 68020.)

Wed Jun 7 18:27:10 AEST 1989

In article <13592 at haddock.ima.isc.com> suitti at haddock.ima.isc.com
(Stephen Uitti) writes:
>The VAX (PCC based) compiler's optimizer would convert many loops
>that used multiple instructions to use a single complicated
>instruction.

It still does.

>Unfortunately, the VAX 780 (ubiquitous at the time) generally ran
>these slower.

I am not sure about `generally', but some do run slower (alas!).

>One case was the acbl (add, compare and branch longword).  The
>optimizer would replace the three instructions (something like:)
>	add $1,r2
>	cmp $END,r2
>	bne L1
>
>with the "acbl" and all its arguments.  Both codings take the
>same number of bytes (25!).

The original code is

	<code>
L1:	cmpl	END,reg
	jgtr	L2
	<loop body>
	addl2	$CONST,reg	# CONST must be positive
	jbr	L1
L2:	<more code>

and the replacement is

	<code>
	decl	reg
	jbr	L2
L1:	<loop body>
L2:	acbl	END,$CONST,reg,L1
	<more code>

or, if the loop body is short enough (<= 130 bytes, although c2 uses
the improper metric of `eight instructions') and CONST is 1, the
acbl is replaced with an aobleq.

In neither case is the instruction byte count 25; if we count only
instructions inside the loop, the acbl is always shorter, and in any
case it is never longer.  If we assume all branches are in bounds
(which gives the best situation for the cmp/jgtr code), we get:

L1:	cmp	END,reg		# 2+k bytes, k=1 for END constant or reg
	bgtr	L2		# 2 bytes
	<loop body>
	addl2	$CONST,reg	# 2+l bytes, l=1 for CONST in 0..63
	brb	L1		# 2 bytes
L2:	<more code>		# total = 8+k+l (typically 10)

versus:

	decl	reg		# 2 bytes, but outside loop
	brb	L2		# 2 bytes, but outside loop
L1:	<loop body>
L2:	acbl	END,$CONST,reg,L1 # 1+k+l+1+2 = 4+k+l
	<more code>		# total = 4+k+l in loop, 8+k+l overall

The most common constant, however, is 1; if the branches are in
bounds, we get an aobleq and the count drops to

L2:	aobleq	reg,END,L1	# 1+1+k+1 = 3+k

for 3+k bytes in the loop and 7+k bytes total; in this case l=1 and
the cmp/jgtr loop takes 9+k bytes total, all in the loop.

If the branches are not in range, the picture becomes much murkier,
as the `jxxx' opcodes usually `explode', but sometimes `tunnel':

		jgtr	L3
		<code>
		<more code>
	L3:

can become

		jleq	Laround		# invert the branch
		brw	L3		# and use a longer form
	Laround:<code>
		<more code>
	L3:

but if there happens to be a branch to L3 (either unconditional or
bgtr) that is in range of the first branch, we might get instead

		bgtr	Ltunnel		# get to someone who gets us there
		<code>
	Ltunnel:brw	L3
		<more code>
	L3:

>The multiple instructions are faster (on the 780).

This I do not intend to test (780s being rather uninteresting).

Anyway, the real point of all this is that instruction selection,
especially for baroque CISCs like a VAX, can be difficult....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris at mimsy.umd.edu	Path:	uunet!mimsy!chris