function calls

Sat Mar 17 01:32:26 AEST 1990

In article <14272 at lambda.UUCP> jlg at lambda.UUCP (Jim Giles) writes:

| Careful register-allocation conventions are usually the ones that use the
| registers most greedily.  This is because, in general, the more you use
| the registers, the faster your code goes.  If your "careful" approach to 
| registers is not greedy, how much performance am I loosing to it by not
| getting full use out of the hardware availible?  Further: what's "an
| adequate supply of registers"?  I know code which can use up as many as
| you give me.  In fact, if interprocedural analysis were commonplace
| instead of rare, you would probably find that the register set was almost
| always completely full (this wouldn't matter since it would be a logical
| consequence of register allocation being done on the program instead of
| a routine at a time).
| 
| These problems may all be solved in the future - even the very near future.
| But at present, only MIPS and Cray (the only ones mentioned anyway) have
| addressed this problem.  And these two 'solutions' rely on the 'callee'
| not using lots of registers and the 'caller' deliberately leaving some
| spare ones - but this, in itself, may have a negative impact on performance.

Umm, the 88k also partitions the register set into caller save and
callee save.  For the two machines that I'm familar with, the
breakdown is as follows:

MIPS:	32 integer 32-bit registers
		 1 register hardwired to 0
		 2 registers used for return value and/or staic link
		 4 registers used for arguments
		10 registers not preserved across calls
		 9 registers preserved across calls
		 1 register for stack pointer
		 1 register for return address
		 1 register for accessing small static/global data
		 1 register used by the assembler
		 2 registers used by the kernel

	16 floating point 64-bit registers
		 2 registers for return value
		 2 registers for arguments
		 6 registers not preserved across calls
		 6 registers preserved across calls

88K:	32 32-bit registers (double precision takes 2 regs)
		 1 register hardwired to 0
		 8 registers for arguments & return value
		 4 registers not preserved across calls
		13 registers preserved across calls
		 4 registers reserved for linker/OS
		 1 register for the stack pointer
		 1 register for the return address

Note that registers used for passing arguments, and returning values
are also used as temps.  If the return address is stored on the stack,
the register that holds can also be used for a temporary.  Neither
architecture requires the use of a frame pointer, though frame
pointers can be synthesized easily if needed because variable sized
stack allocations are done.  Finally, both machines software defines
static tables that describe where registers are stored on the stack,
and what register and offset from that register are to be used as a
virtual frame pointer for use in the library and in the debugger.

The MIPS compilers also have a -O3 option which does global register
allocation.  Here is an fragment of the man page from a Decstation:

          -O3            Perform all optimizations, including global
                         register allocation.  This option must
                         precede all source file arguments.  With this
                         option, a ucode object file is created for
                         each C source file and left in a .u file.
                         The newly created ucode object files, the
                         ucode object files specified on the command
                         line, the runtime startup routine, and all
                         the runtime libraries are ucode linked.
                         Optimization is performed on the resulting
                         ucode linked file and then it is linked as
                         normal producing an a.out file. A resulting
                         .o file is not left from the ucode linked
                         result.  In fact -c cannot be specified with
                         -O3.

          -feedback file Use with the -cord option to specify the
                         feedback file.  This file is produced by with
                         its -feedback option from an execution of the
                         program produced by

          -cord          Run the procedure-rearranger on the resulting
                         file after linking.  The rearrangement is
                         performed to reduce the cache conflicts of
                         the program's text.  The output is left in
                         the file specified by the -o output option or
                         a.out by default.  At least one -feedback
                         file must be specified.

Because of the restriction of not specifying -c, I'm not sure how many
people use this in practice for large software.  I would imagine that
for programs which use use runtime binding (ie, emacs, or C++ code
with virtual functions), it would default back to the standard calling
sequence.  I wonder how much it buys for typical software as opposed
to special cases.
--
Michael Meissner	email: meissner at osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so