VAX loops (was Re: Optimal for loop on the 68020.)
Chris Torek
chris at mimsy.UUCP
Wed Jun 7 18:27:10 AEST 1989
In article <13592 at haddock.ima.isc.com> suitti at haddock.ima.isc.com
(Stephen Uitti) writes:
>The VAX (PCC based) compiler's optimizer would convert many loops
>that used multiple instructions to use a single complicated
>instruction.
It still does.
>Unfortunately, the VAX 780 (ubiquitous at the time) generally ran
>these slower.
I am not sure about `generally', but some do run slower (alas!).
>One case was the acbl (add, compare and branch longword). The
>optimizer would replace the three instructions (something like:)
> add $1,r2
> cmp $END,r2
> bne L1
>
>with the "acbl" and all its arguments. Both codings take the
>same number of bytes (25!).
The original code is
<code>
L1: cmpl END,reg
jgtr L2
<loop body>
addl2 $CONST,reg # CONST must be positive
jbr L1
L2: <more code>
and the replacement is
<code>
decl reg
jbr L2
L1: <loop body>
L2: acbl END,$CONST,reg,L1
<more code>
or, if the loop body is short enough (<= 130 bytes, although c2 uses
the improper metric of `eight instructions') and CONST is 1, the
acbl is replaced with an aobleq.
In neither case is the instruction byte count 25; if we count only
instructions inside the loop, the acbl is always shorter, and in any
case it is never longer. If we assume all branches are in bounds
(which gives the best situation for the cmp/jgtr code), we get:
L1: cmp END,reg # 2+k bytes, k=1 for END constant or reg
bgtr L2 # 2 bytes
<loop body>
addl2 $CONST,reg # 2+l bytes, l=1 for CONST in 0..63
brb L1 # 2 bytes
L2: <more code> # total = 8+k+l (typically 10)
versus:
decl reg # 2 bytes, but outside loop
brb L2 # 2 bytes, but outside loop
L1: <loop body>
L2: acbl END,$CONST,reg,L1 # 1+k+l+1+2 = 4+k+l
<more code> # total = 4+k+l in loop, 8+k+l overall
The most common constant, however, is 1; if the branches are in
bounds, we get an aobleq and the count drops to
L2: aobleq reg,END,L1 # 1+1+k+1 = 3+k
for 3+k bytes in the loop and 7+k bytes total; in this case l=1 and
the cmp/jgtr loop takes 9+k bytes total, all in the loop.
If the branches are not in range, the picture becomes much murkier,
as the `jxxx' opcodes usually `explode', but sometimes `tunnel':
jgtr L3
<code>
<more code>
L3:
can become
jleq Laround # invert the branch
brw L3 # and use a longer form
Laround:<code>
<more code>
L3:
but if there happens to be a branch to L3 (either unconditional or
bgtr) that is in range of the first branch, we might get instead
bgtr Ltunnel # get to someone who gets us there
<code>
Ltunnel:brw L3
<more code>
L3:
>The multiple instructions are faster (on the 780).
This I do not intend to test (780s being rather uninteresting).
Anyway, the real point of all this is that instruction selection,
especially for baroque CISCs like a VAX, can be difficult....
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: chris at mimsy.umd.edu Path: uunet!mimsy!chris
More information about the Comp.unix.wizards
mailing list