alloca wars

Sun Aug 7 08:41:47 AEST 1988

> > 2.)  Alloca is less available than malloc/free.
> On a M68k with a decent OS, alloca is not more than a few lines of assembly
> code, correct?  (Judging by the size of the GNU assembly alloca)...
> If one needs alloca and it is not available, why not write a quick alloca?
> If the machine architecture or OS is braindamaged, then one would have
> to program around it.  GNU does that also.

If the compiler knows about alloca [Read:  recognizes the name as special 
and refrains from doing certain things because of it, OR recognizes the name 
as special and implements it as a built-in], , it could be made to work 
nicely.  If the compiler doesn't know about alloca you can have a real 
headache trying to implement it as it was intended (memory on the stack 
frame), even with a "nice" CPU.  

Further, even with the ultimate alloca implementation, GNU GCC with
alloca built into the compiler, it still gets it wrong.

Take, for example, the following 68000 linkage conventions:

d0-d1, a0-a1 are clobbered across calls.  d2-d7, a2-a5 are preserved across 
calls.  a6 is the frame pointer.  a7 is the stack pointer.  Args are
on the stack, first arg at the lowest address.  CALLER POPS ARGS FROM THE 
STACK.  Function return value is in d0 or d0/d1 (for doubles).

These are not negotiable, unless I scrap all my libraries and the compiler
that came with the system, and actually seem fairly reasonable.  These are 
used on Tandy's 68k Xenix compiler, and some other systems, like the sun 
(2 and/or 3) version of the GNU compiler, which I have adapted to also
work on a Tandy 6000.  

Ok, so alloca should be simple, right?  
	- Leave a2-a6 and d2-d7 alone, so you don't have to save them.
	- Pop the return address into a register (a0 probably). [1 instruction]
	- Pick up the requested amount of memory and round it up to
	  a multiple of 2. [2-3 instructions]
	- Subtract the rounded amount of memory from the stack pointer.
	  [1 instruction]
	- Put the stack pointer + the size of the argument ( = 4) in d0
	  (because the caller is going to adjust the stack to remove
	  that argument). [1-2 instructions]
	- Return to the saved return address. [1 instruction]

[Nitpickers:  I really don't care if I'm one or two instructions off in the
above estimates]

Simple, huh?  So how come bison infinite loops?

Typical function entry/exit code for Tandy's compiler:
	.globl _foo
_foo:
	link	a6,#-<stack frame size>
	moveml	#<<bunch of registers>>,-(a7)
	...
	moveml	(a7)+,#<<bunch of registers>>		| AARRRGGGGHHHH!!!!
	unlk	a6
	rts
The "bunch of registers" is determined from what is actually used.  There
are variations of this code if 0 or 1 registers need to be saved.

Notice the instruction marked "AARRRGGGGHHHH".  If alloca moves the stack 
pointer, then it has to allocate some extra memory and leave a copy of the 
saved registers in them, so this instruction can pop them off.  (Why didn't 
the compiler writer use "moveml <offset>(a6),#<bunch of registers>" instead?  
I don't care, really, but it might be because the code it actually generates 
is 2 bytes shorter (and 1 clock cycle faster on a 68020).)  So, my first shot 
at alloca trashes all of the registers in the caller of the caller of alloca.

Ok, how much memory?  40 bytes (d2-d7 and a2-a5), regardless of whether the 
function actually saves them or not.  Why not find out what's actually saved?  
To do this, you'd need to find the entry point of the function and disassemble 
a few instructions.  The frame pointer points to the return point in the 
caller of the caller of alloca.  However, in 68000 code, given the address 
of the end of an instruction and the contents of memory, you cannot always
uniquely find the address of the beginning of it.  (Ok, maybe 99% of the time 
you can, but having any need for a "panic:  alloca can't find it's caller - 
program aborted" message is unacceptable).  Besides, if the caller of alloca 
was called through a function pointer (I DID NOT say alloca was called through 
a function pointer!  although that wouldn't matter much anyway), the address 
might be and usually is in a0, and has been destroyed before alloca is called.  

Now, on to another issue: function calls and arguments.  Assume that the 
compiler has a maximum of N temporary locations in registers to store 
computed arguments that haven't been pushed on the stack yet.  (Suppose for 
this example that N = 2).  Assume that the compiler pushes its arguments onto 
the stack in an endian order (I did NOT say which end!).  The arguments and 
return types of foo and bar are (char *), and bar returns its first argument.  
How does the compiler evaluate this (foo is presumed to have 2N+3 arguments):

	foo(bar(alloca(20)), bar(alloca(20)), bar(alloca(20)), alloca(40),
		bar(alloca(20)), bar(alloca(20)), bar(alloca(20)) );

Can anyone find ANY compiler on a system with caller-pops-args linkage
conventions that gets this right?  Or for that matter, any compiler on
ANY system where alloca really allocates memory on the stack?

To correctly evaluate this, the compiler would have to evaluate 
"bar(alloca(20))" 3 times, store the results somewhere other than pushing
on the stack, and since it has only 2 registers, one of them has to be 
elsewhere (stack frame relative temporary seems to be the only place left), 
evaluate alloca(40), at which point it had better not have pushed anything on 
the stack, and then evaluate "bar(alloca(20))" 3 more times.  THEN it fetches 
the 7 values from assorted storage and pushes them on the stack.

Now, consider a compiler that doesn't know alloca is special.  Can you
imagine the laughter if it always generated the above sequence for
every function, alloca or not?  Storing stuff on the stack frame, then
copying it with a push-contents-of-memory-on-stack instruction?  Inefficient
code in the extreme ...  .  

What it's really going to do is evaluate "bar(alloca(20))", push the result on 
the stack, repeat 2 more times, evaluate "alloca(40)", push the result on 
the stack, and repeat evaluating and pushing "bar(alloca(20))" 3 times.  
foo() will then be called with all but the last-pushed argument in the wrong 
place.  This is, in fact, how gcc 1.24 evaluates it, WITH AND WITHOUT 
using the builtin alloca.  With the builtin and without the optimizer, the 
alloca() becomes 2 instructions (fix stack and move return value to d0).
The round-up-to-multiple-of-2 gets done at compile time since the arguments
to the alloca calls were constant.

Consider the GNU compiler feature "defer-pop".  If a compiler defers
taking arguments off the stack after a function call, and doesn't recognize 
alloca as a clue to turn off this behavior, and there is no flag to
control it, then I have to throw up my hands and abandon any attempt to
write alloca.  Supposedly GCC does recognize alloca as special for this
purpose, and in any case, it has flags to control it.

Back to my Tandy 68k machine.  To accomodate the args-on-the-stack problem, 
I have to make alloca copy more of the stack frame to handle the worst-case 
of args pushed on the stack when it's called.  There is no worst-case, but 5 
args ( = 20 more bytes) might be good enough.

I got lucky.  Tandy's compiler references auto variables relative to the
stack frame pointer.  If it referenced them relative to the stack pointer,
I'd be out of luck.

So now it works like this:
	- Leave a2-a6 and d2-d7 alone, so you don't have to save them.
	- Pop the return address into a register (a0 probably). [1 instruction]
	- Pick up the requested amount of memory and round it up to
	  a multiple of 2, and add 60. [2-3 instructions]
	- Subtract the rounded amount of memory from the stack pointer.
	  [1 instruction]
	- Copy 60 bytes from the old stack pointer + 4 to the new stack
	  pointer + 4 for length 60.  [15 instructions, straightline.
	  Could be less but slower with a loop]
	- Put the stack pointer + the size of the argument plus the
	  copied stack ( = 64) in d0 (because the caller is going to adjust 
	  the stack to remove that argument). [1-2 instructions]
	- Return to the saved return address. [1 instruction]

So I have a version of alloca that allocates its memory on the stack frame
and wastes 60 bytes per call.  If a compiler uses defer-pop, it's 
going to screw up.  If there are more than 5 arguments on the stack when
it's called, it's going to screw up.  But bison doesn't infinite loop
any more.

Why don't I post it?  Several reasons:  
- It's too compiler- and system- specific to be of much use to many people.
- The waste of memory per call is embarrasing.
- It was debugged using bison, and according to discussions going on in 
  gnu.gcc, it is probably subject to the GNU Public License, and I'd have to 
  post the source to everything I used it in.  I don't have that much disk 
  space, and neither does the net.
- It still doesn't work right, in ways I indicated.

				Gordon L. Burditt
				killer!ninja!sneaky!gordon